The Curse of Dimensionality and Decision Theory

The curse of dimensionality

Consider a dataset created by aggregating measurements taken from a pipeline containing a mixture of oil, water, and gas (Bishop & James, 1993). These three materials can exist in one of three different geometric configurations called ‘homogeneous’, ‘annular’, and ’laminar’, and the ratio of these materials can also change. Each data point consists of a 12-dimensional input vector. $Fig. 1$ shows 100 points from this dataset on a graph illustrating two measurements, $x_6$ and $x_7$ (the other 10 input values are omitted for clarity). Each data point is labeled according to the geometric class it belongs to, and our goal is to use this dataset as a training set to classify a new observation $(x_6, x_7)$, such as the point marked with a cross in $Fig. 1$.

We observe that the cross is surrounded by many red points, so we might assume it belongs to the red class. However, there are also many green points nearby, so it could be assigned to the green class, and it is unlikely to belong to the blue class. Intuitively, we see that the class of the cross should be determined primarily by the data points closer to it rather than those farther away.

A simple approach to implement this intuition into an algorithm is to divide the input space into uniform square cells, as shown in $Fig. 2$. To determine the class of a point, we decide which cell it belongs to, and then find all training points that fall into the same cell. The class of the point to be predicted is the class that contains the most training points within that cell.

There are several issues with this simple approach, and one of the most serious becomes apparent when we consider extending it to problems with a larger number of input variables, corresponding to a higher-dimensional input space. The origin of the problem is illustrated in $Fig. 3$, showing that, if we divide a region of space into uniform square cells, the number of these cells increases exponentially with the dimensionality of the space.

The problem with an exponentially large number of cells is that we would need an exponentially large amount of training data to ensure that these cells are not empty. Clearly, we have little hope of applying this technique in higher-dimensional spaces, and thus we need to find an alternative approach.

Returning to the example of polynomial curve fitting (see previous post) and considering how we extend this approach to handle input spaces with many variables, if we have $D$ input variables, a general polynomial with cubic coefficients will have the form:

$$y(x, \mathbf{w}) = w_0 + \sum_{i=1}^{D} w_i x_i + \sum_{i=1}^{D} \sum_{j=1}^{D} w_{ij} x_i x_j + \sum_{i=1}^{D} \sum_{j=1}^{D} \sum_{k=1}^{D} w_{ijk} x_i x_j x_k.$$

As $D$ increases, the number of independent coefficients (not all coefficients are independent due to symmetry between the variables $\mathbf{x}$) grows at a rate of $D^3$. In practice, to capture the complex dependencies in the data, we may need to use a higher-degree polynomial. For a polynomial of degree $M$, the number of coefficients increases as $D^M$. Although this is exponential growth, it still indicates that this method becomes impractical.

Geometric intuitions formed in three-dimensional space can be misleading when considering higher-dimensional spaces. A simple example, consider a sphere with radius $r = 1$ in $D$-dimensional space, and ask the ratio of the volume of the sphere between radii $r = 1-\epsilon$ and $r = 1$. We can evaluate this ratio by noting that the volume of a sphere with radius $r$ in $D$-dimensional space should scale with $r^D$, so we have: $$V_D(r) = K_D r^D$$ where the constant $K_D$ depends only on $D$. Thus, the ratio is given by: $$\frac{V_D(1) - V_D(1-\epsilon)}{V_D(1)} = 1 - (1-\epsilon)^D$$

The graph of this function for different values of $D$ is shown in $Fig. 4$. We see that for large $D$, this ratio approaches 1 even for small values of $\epsilon$. Therefore, in high-dimensional space, most of the volume of the sphere is concentrated in a thin shell near the surface. This contrasts with our intuition in the usual three-dimensional (3D) space, where the spherical volume is distributed evenly throughout the entire interior.

Another example, directly related to pattern recognition, is the behavior of a Gaussian distribution in multidimensional space. If we transform from Cartesian coordinates to polar coordinates, and then integrate over the angular variables, we obtain an expression for the density $\rho(r)$ as a function of the radial distance $r$ from the origin. Thus, $\rho(r) \delta r$ represents the probability mass within a thin shell of thickness $\delta r$ at radius $r$. This distribution is plotted for different values of $D$ in $Fig. 5$, and we see that for large $D$, the probability mass of the Gaussian concentrates in a thin shell near the surface, rather than being evenly distributed around the center as we might expect in low-dimensional space. Specifically, in low-dimensional space (e.g., $D = 1$ or $D = 2$), the Gaussian distribution typically has its probability mass concentrated close to the center (here $r=0$). But as the dimensionality $D$ increases, the Gaussian’s probability mass tends to “move away” from the center and concentrate in a region of larger radius, resulting in the probability being primarily distributed at a specific radius value.

These difficulties are known as the Curse of Dimensionality (CoD), a phenomenon that arises when the input space has a high number of dimensions, leading to several challenges in data analysis and processing. To summarize, the issues related to CoD include:

The number of dimensions ($D$) increases the need for much larger amounts of training data to ensure the space is sufficiently covered, increasing complexity and computational costs.
The number of coefficients required in mathematical models, such as polynomials, increases rapidly as $D$ grows, making these models more complex and difficult to apply.
As $D$ increases, the volume or probability mass concentrates more at the boundaries and no longer distributes evenly as in low-dimensional spaces, leading to difficulties (in machine learning or statistics) in handling the data, reducing the effectiveness of many distance- or density-based methods.
As $D$ increases, the distance between data points increases, but the discriminability between data points decreases. When the distances between points become large and relatively uniform, traditional machine learning or statistical models (e.g., regression, SVM, KNN) may not perform effectively because they rely on clear distinctions between data points to learn or classify, but when the distances become relatively uniform, models struggle to find the correct classification boundaries.
As $D$ increases, the space becomes “sparser.” That is, if the number of data points remains constant, the point density in the space drops significantly. This makes it harder to build data-driven models as the model has fewer data points to analyze and predict, reducing its ability to differentiate between different data groups or classes.

Decision Theory

1. Introduction

Probability theory provides a consistent mathematical framework for quantifying and handling uncertainty. In this section, we will discuss decision theory, which, when combined with probability theory, allows us to make optimal decisions in situations involving uncertainty, such as in pattern recognition.

Suppose we have a medical diagnosis problem where we have taken an X-ray of a patient and want to determine whether the patient has cancer or not. In this case, the input vector $\mathbf{x}$ is the set of pixel intensities in the image, and the target variable $t$ represents the presence of cancer (denoted as class $C_1$) or no cancer (denoted as class $C_2$).

We can choose $t$ as a binary variable such that $t = 0$ corresponds to class $C_1$ and $t = 1$ corresponds to class $C_2$. The general inference problem involves determining the joint distribution $p(\mathbf{x}, C_k)$, which is equivalent to $p(\mathbf{x}, t)$, because it summarizes all the uncertainty (uncertainty) here related to the two variables $\mathbf{x}$ and $\mathbf{t}$. This is because the joint probability distribution $p(\mathbf{x}, \mathbf{t})$ describes the probability of all possible pairs of values for $\mathbf{x}$ and $\mathbf{t}$. This includes all the information about how the input variables $\mathbf{x}$ and the target variable $\mathbf{t}$ interact and depend on each other. Additionally, from $p(\mathbf{x}, \mathbf{t})$, we can also find the marginal distributions $p(\mathbf{x})$ and $p(\mathbf{t})$, along with the conditional distribution $p(\mathbf{t}|\mathbf{x})$.

While this joint distribution is a very useful quantity, in practice, we must make a specific prediction about the value of $t$, or more generally, perform a specific action based on our understanding of the possible values that $t$ could take. This is the subject of decision theory.

2. Minimizing the misclassification rate

Suppose our goal is to minimize the number of misclassifications as much as possible. We need a decision rule that divides the input space into decision regions $R_k$ for each class $C_k$, so that every point in $R_k$ is assigned to class $C_k$. The probability of making a mistake occurs when a point in region $R_1$ is assigned to class $C_2$ or vice versa, and is given by:

$$ p(\text{mistake}) = \int_{R_1} p(x, C_2) , dx + \int_{R_2} p(x, C_1) , dx $$

We will choose the decision rule so that each $x$ is assigned to the class with the largest posterior probability $p(C_k|x)$. This is illustrated in $Fig. 6$, where the regions $R_1$ and $R_2$ are determined by the decision boundary $\hat{x}$. As the position of this boundary changes, the area of the red region (representing the error) will change. The optimal position of $\hat{x}$ is the point where the two curves $p(x, C_1)$ and $p(x, C_2)$ intersect.

3. Minimizing the expected loss

In many applications, the objective is more complex than simply minimizing the number of misclassifications. For example, in medical diagnosis, if a healthy person is classified as having a disease, the situation is certainly not as serious as if a diseased person is misclassified as healthy, since the latter could be life-threatening. Clearly, we need to reduce the second type of misclassification, even if it means that the first type of misclassification may increase.

To handle these issues, we use a loss function $L_{kj}$, which measures the degree of loss when assigning $x$ to class $C_j$ when the true class is $C_k$. For example, in the cancer diagnosis problem, the loss matrix may look as follows:

$$ \begin{pmatrix} \text{gt/pred} & \text{cancer} & \text{normal} \\ \text{cancer} & 0 & 1000 \\ \text{normal} & 1 & 0 \end{pmatrix} $$

This matrix shows that no loss occurs when the classes are correctly classified, and the loss value is $1000$ if a diseased person is classified as healthy. The loss value for misclassifying a healthy person as diseased is $1$. It is reasonable to impose a heavy penalty for the second type of misclassification, which we consider extremely dangerous and should be minimized at all costs.

The goal now is to minimize the expected loss:

$$ \mathbb{E}[L] = \sum_k \sum_j \int_{\mathcal{R_j}} L_{kj} p(x, C_k) , dx \tag{1} $$

Each $x$ will be independently assigned to the decision region $\mathcal{R_j}$. The optimal decision rule will be one that selects the region $\mathcal{R_j}$ to minimize $(1)$, which means that for each $x$, we need to minimize $\sum_k \int_{\mathcal{R_j}} L_{kj} p(x, C_k)$. Additionally, since $p(x, C_k) = p(C_k|x)p(x)$, and $p(x)$ is a common factor for all $j$, it can be factored out. Therefore, the loss function we need to minimize becomes:

$$ \mathbb{E}[L] = \sum_k L_{kj} p(C_k|x) \tag{2} $$

4. The reject option

In some applications, we may want to avoid making decisions in cases where the input is unclear. This can be achieved by introducing a threshold $\theta$ and rejecting input $x$ when the largest posterior probability $p(C_k|x)$ is less than or equal to $\theta$. This helps reduce the error rate in decision-making when uncertainty is high. $Fig. 7$ illustrates how a threshold $\theta$ can be used to define rejection regions.

5. Inference and decision

The process of solving a classification problem can be divided into two steps: the inference step, where we determine the model $p(x, t)$, and the decision step, where we use posterior probabilities to make an optimal decision.

There are three main approaches to solving decision problems:

$(a)$ Solve the inference problem by determining the conditional density of each class $p(x|C_k)$ for each class $C_k$, then use Bayes’ theorem to find the posterior probability $p(C_k|x)$:

$$ p(C_k|x) = \frac{p(x|C_k) p(C_k)}{p(x)} \tag{3} $$

$(b)$ Directly determine the posterior probability $p(C_k|x)$, then use decision theory to assign $x$ to one of the classes.
$(c)$ Find a function $f(x)$, called a discriminant function, to directly map each $x$ to a class label.

Approach $(a)$ is the most complex because it requires determining the joint distribution over both $x$ and $C_k$. However, this approach also allows us to determine the marginal density $p(x)$, which can be useful for detecting new data points with low probability under the model. But if we are only concerned with classification, finding the joint distribution is costly and redundant. Instead, approach $(b)$ is sufficient for classification tasks.

Approach $(c)$ combines both inference and decision in a single step, where we use the function $f(x)$ to directly map input $x$ to class $C_k$. In the example from $Fig. 8$, this corresponds to finding the value of $x$ represented by the green line, as this is the decision boundary for the lowest misclassification rate.

However, with this approach, we do not know the posterior probabilities $p(C_k|x)$. There are several reasons why we might need to compute posterior probabilities, even if we later use them to make decisions. These reasons include:

Minimizing risk: Consider a problem where the elements of the loss matrix change over time. If we know the posterior probabilities, we can easily adjust the decision rule to minimize risk by modifying equation $(2)$ accordingly. If we only have a discriminant function, any changes to the loss matrix would require us to go back to the training data and solve the classification problem again.
Reject option: The posterior probabilities allow us to define a rejection criterion, which will minimize the misclassification rate, or more generally, the expected loss, for a specific portion of the data points being rejected.
Compensating for prior class probabilities: In imbalanced class problems (such as cancer diagnosis tasks), using posterior probabilities allows us to adjust predictions based on the actual class distribution in the dataset. Consider the medical X-ray problem, where we assume a large number of X-ray images have been collected from the general population to build an automatic screening system. Since cancer is rare in the general population, we may find that only 1 in 1000 examples represents cancer. If we use such a dataset to train an adaptive model, we might face significant difficulties due to the small proportion of cancer cases. For example, a classifier that assigns all points to the normal class would achieve 99.9% accuracy and would be hard to avoid as a trivial solution. Additionally, even a large dataset will contain very few cancer-related X-ray images, so the learning algorithm may not generalize well. A balanced dataset with equal numbers of examples from each class will allow us to find a more accurate model. However, we must then compensate for the effects of the adjustments made to the training data. Suppose we have used such a modified dataset and found models for the posterior probabilities. From Bayes’ theorem $(3)$, we see that the posterior probabilities are proportional to the prior probabilities, which we can interpret as the ratio of points in each class. Thus, we can simply take the posterior probabilities obtained from our balanced dataset, divide them by the class proportions in that dataset, and then multiply by the class proportions in the dataset we want to apply the model to. Finally, we need to normalize to ensure that the new posterior probabilities sum to 1. Note that this procedure cannot be applied if we have learned a discriminant function instead of determining the posterior probabilities.
Combining models: In complex problems with multiple data sources, using the Naive Bayes model and conditional independence allows us to efficiently combine information from different sources while maintaining prediction accuracy. Specifically, for complex applications, we may want to break the problem down into smaller sub-problems, each of which can be solved by a separate module. For example, in our medical diagnosis problem, we might have information from both blood tests and X-ray images. Instead of combining all this heterogeneous information into a large input space, it may be more efficient to build one system to interpret the X-ray images and another system to interpret the blood data. As long as each of these models provides posterior probabilities for the classes, we can combine their outputs systematically using probabilistic rules. A simple way to do this is to assume that, for each class, the input distributions for the X-ray image data, denoted as $x_1$, and the blood data, denoted as $x_B$, are independent, so:

$$ p(x_1, x_B | C_k) = p(x_1 | C_k) p(x_B | C_k) \tag{4} $$

This is an example of conditional independence, as this independence holds when the distribution is conditioned on class $C_k$. The posterior probability, with both X-ray and blood data, is then given by:

$$ p(C_k | x_1, x_B) \propto p(x_1, x_B | C_k) p(C_k) $$

$$ \propto p(x_1 | C_k) p(x_B | C_k) p(C_k) $$

$$ \propto \frac{p(C_k | x_1) p(C_k | x_B)}{p(C_k)} $$

Therefore, we need the prior probabilities of the class $p(C_k)$, which we can easily estimate from the proportions of the data points in each class, and then we need to normalize the resulting posterior probabilities so that they sum to 1. The specific conditional independence assumption $(4)$ is an example of a Naive Bayes model. Note that the marginal distribution $p(x_1, x_B)$ will generally not factorize under this model. In later chapters, we will see how to build models that combine data without requiring the conditional independence assumption $(4)$.

In general, using posterior probabilities not only provides a powerful tool for making accurate decisions but also allows flexibility in handling complex and time-varying situations.

6. Loss functions for regression

The regression function $y(x)$ is the conditional expectation of $t$ given $x$, and it is also called the regression function.
In the illustration example $Fig. 9$, this graph shows that when $x = x_0$, the value $y(x_0)$ is determined by the mean value of the conditional distribution $p(t|x_0)$, which is represented by the dashed green horizontal line.
The regression function optimizes squared loss expectation by averaging the values of $t$ according to the conditional distribution $p(t|x)$.

The decision-making process involves predicting the actual value $y(x)$ corresponding to the output $t$ of the input variable $x$. The expected loss function over all the random variables $x$ and $t$:

$$ \mathbb{E}[L] = \iint L(t, y(x))p(x, t) , dx , dt $$

In the specific case of squared loss:

$$ \mathbb{E}[L] = \iint {y(x) - t}^2 p(x, t) , dx , dt $$

Solving the differential equation to minimize $\mathbb{E}[L]$, we get:

$$ y(x) = \frac{\int t p(x, t) , dt}{p(x)} = E_t[t|x] $$

This shows that the regression function $y(x)$ is the conditional mean of $t$ given $x$.

Minkowski loss function

Squared loss is not the only choice for regression problems. A generalized case of squared loss is called Minkowski loss, with the expectation given by:

$$ \mathbb{E}[L_q] = \iint |y(x) - t|^q p(x, t) , dx , dt $$

This loss function shrinks to squared loss when $q = 2$, and the minimum value of $\mathbb{E}[L_q]$ is given by the conditional mean when $q = 2$, the conditional median when $q = 1$, and the conditional mode when $q \to 0$.

Approaches for regression

$(a)$ Solve the inference problem to determine the joint density $p(x, t)$, then normalize to find the conditional density $p(t|x)$, and finally compute the conditional expectation using the formula:

$$ y(x) = E_t[t|x] $$

$(b)$ Solve the inference problem to find the conditional density $p(t|x)$, then integrate to find the conditional expectation.
$(c)$ Directly find the regression function $y(x)$ from the training data.

Each approach has its own advantages and disadvantages, depending on the complexity of the specific problem.