1. Problem Description
In machine learning, we are often given a dataset $x$, along with an observation set $t$. Our task is to predict the corresponding value $t_j$ as accurately as possible for each new $x_j$.
Consider a specific example: we have a training dataset with $N$ data points $\mathbf{x} \equiv (x_1, \ldots, x_N)^T$, along with an observation set $\mathbf{t} \equiv (t_1, \ldots, t_N)^T$. Each point $x_n$ with $n \in [1, \ldots, N]$ is generated from a uniform distribution in the range $[0, 1]$, and the corresponding $t_n$ is calculated as $t_n = 2 \pi x_n + \epsilon$, where $\epsilon$ is a small Gaussian-distributed random noise. Adding such noise ensures a characteristic of real-world datasets: data points generally follow a common pattern (which is what we want the machine to learn), but each individual data point will be perturbed by random variables.
Now, our goal is to approximate $t_t$ when a new data point $x_t$ is introduced, meaning that our model must closely approximate the function $2 \pi x$.
2. Polynomial Function Approximation Method
One of the simplest and most popular methods to solve this problem is to find a polynomial function $f(x) = t$ to approximate the relationship between $x$ and $t$.
Polynomial curve fitting is a regression analysis method to find the best-fitting polynomial function for a set of data points. Why use a polynomial function? In reality, any function can be approximated by a polynomial function, with increasing accuracy as the degree of the polynomial increases. That is, with a sufficient number of degrees, a polynomial function can closely approximate any type of nonlinear relationship between variables in the data. We can approximate by introducing a polynomial function as follows:
$$ \begin{align*} f(x, \mathbf{w}) &= w_0 + w_1x^1 + w_2x^2 + \ldots + w_Nx^N \\ &= \sum_{j=0}^N w_jx^j \end{align*} $$
where $N$ is the degree of the polynomial, $x^j$ is the power of $x$, and $w_0, \ldots, w_N$ are the polynomial coefficients. Although the function $f(x, w)$ is nonlinear in terms of $x$, it is linear with respect to $w$. This leads to a property of $f(x, w)$, which is the linear model characteristic, typical of polynomial functions that are linear in hidden variables (in this case, the $w$ variables).
To find the values of these hidden variables, we fit the function $f(x, w)$ to the training dataset and then use an error function like MSE to minimize the difference between the approximation result of $f(x, w)$ and the corresponding observed variable $t$. The MSE error function in this case has the form:
$$ E(\mathbf{w}) = \frac{1}{2}\sum_{n=1}^N(f(x_n, \mathbf{w}) - t_n)^2 \tag{2.1} $$
where the factor $\frac{1}{2}$ is for convenience in calculation. The error function reaches its minimum if and only if the function $f(x, w)$ accurately approximates all data points in the training set. Since the error function is a quadratic function with respect to the coefficients $w$, its derivative with respect to these coefficients is a linear function of $w$. Indeed, the partial derivative of $E(x)$ with respect to $w_j$ is written as:
$$ \frac{\partial E(\mathbf{w})}{\partial w_j} = \frac{\partial}{\partial{w_j}} (\frac{1}{2}\sum_{n=1}^N(f(x_n,\mathbf{w})−t_n)^2) $$
Applying the chain rule, we get:
$$ \begin{align*} \frac{\partial E(\mathbf{w})}{\partial w_j} &= \sum_{n=1}^N(f(x_n,\mathbf{w})−t_n) x^j \tag{2.3} \end{align*} $$
This means that the partial derivative of the error function with respect to the coefficient $w_j$ is a linear function.
Moreover, minimizing the error function $(2.1)$ ensures a unique solution. The error function $E(\mathbf{w})$ is quadratic with respect to the coefficients $\mathbf{w}$, creating a paraboloid surface in multi-dimensional space. This leads to two important consequences:
- Shape of the function: A quadratic function has a parabolic surface as its graph. In multi-dimensional space, this implies that the error function’s graph will be a downward-facing paraboloid (for a global minimum). This shape guarantees a unique minimum point.
- Uniqueness of solution: The first derivative of the quadratic function with respect to $\mathbf{w}$ is a linear function. When we solve the equation $\nabla E(\mathbf{w}) = 0$, we are solving a system of linear equations. This system has a unique solution if the coefficient matrix (created by the first derivatives) is square and non-singular, which is typically the case in linear regression problems (assuming no linear dependence among input variables). This ensures that the error function has a single minimum point, giving a unique solution to the optimization problem.
In summary, we are looking for $\mathbf{w}$ such that $\nabla E(\mathbf{w}) = 0$, meaning the error function reaches its minimum. We can find a closed-form equation for $\mathbf{w}$, whose solution is unique, denoted $\mathbf{w}^*$.
The function $f(x, \mathbf{w})$ is represented by the matrix $\mathbf{X}\mathbf{w}$, where:
$$ \mathbf{X} = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^N \\ 1 & x_2 & x_2^2 & \cdots & x_2^N \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_n & x_n^2 & \cdots & x_n^N \\ \end{bmatrix} $$ $$ \mathbf{w} = \begin{bmatrix} w_0 \\ w_1 \\ w_2 \\ \vdots \\ w_n \end{bmatrix} $$
and $t$ is represented by the vector:
$$ t = \begin{bmatrix} t_0 \\ t_1 \\ t_2 \\ \vdots \\ t_n \end{bmatrix} $$
From here, equation $(2.3)$ can be expressed as:
$$ \begin{align*} & \mathbf{X}^T(\mathbf{X}\mathbf{w}−t)=0 \\ \Leftrightarrow & \mathbf{X}^T\mathbf{X}\mathbf{w} = \mathbf{X}^Tt \end{align*} $$
Thus, the unique solution $\mathbf{w}^*$ is found through the closed-form equation:
$$ \mathbf{w}^*=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^Tt $$
The remaining issue is to choose the degree $N$ so that the function $f(x_n, \mathbf{w})$ fits the original dataset.

Fig.1 Comparison of polynomial curves with different degrees M, represented by red curves, in fitting the original dataset
Fig.1 provides a comparison of the function $f(x_n, \mathbf{w})$ at various degrees. With too low a degree, such as 0 or 1, the curve of $f(x_n, \mathbf{w})$ cannot fit well. A degree of 3 appears to fit well and perfectly. A degree of 9 produces a curve that fits all data points exactly, which may seem ideal, but in practice, such a close fit to training points leads to overfitting, a common problem in machine learning that we aim to avoid. Therefore, a higher degree is not always better; we need to choose a degree that provides a good fit without overfitting.
3. Method Using Probability Distributions
We have seen that the problem of polynomial curve fitting can be represented using error minimization methods. Here, we return to the curve fitting example and view it from a probabilistic perspective, gaining deeper insights into error functions and regularization, leading to a full Bayesian approach.
The goal in the curve fitting problem is to predict the target value $t$ based on a new value of the input variable $x$, given a training dataset of $N$ input values $\mathbf{x} = (x_1, \ldots, x_N)^T$ and corresponding target values $\mathbf{t} = (t_1, \ldots, t_N)^T$. We can represent the uncertainty in the target variable’s value using a probability distribution. For this purpose, we assume that, given $x$, the corresponding value of $t$ has a Gaussian distribution with a mean equal to the polynomial curve value $y(x, \mathbf{w})$. Therefore, we have:
$$ \begin{align*} p(t|x, \mathbf{w}, \beta) = \mathcal{N} \left( t | y(x, \mathbf{w}), \beta^{-1} \right) \tag{3.1} \end{align*} $$
Here, for consistency with later chapters, we define a precision parameter $\beta$, which is the inverse of the variance of the distribution. This is illustrated schematically in Figure 1.16.
We now use the training data ${ \mathbf{x}, \mathbf{t} }$ to determine the unknown parameters $\mathbf{w}$ and $\beta$ using the maximum likelihood method. Assuming the data are independently drawn from distribution $(3.1)$, the likelihood function is given by:
$$ \begin{align*} p(\mathbf{t} | \mathbf{x}, \mathbf{w}, \beta) = \prod_{n=1}^{N} \mathcal{N} \left( t_n | y(x_n, \mathbf{w}), \beta^{-1} \right) \tag{3.2} \end{align*} $$
Similar to the case of a simple Gaussian distribution, it is more convenient to maximize the log of the likelihood function. Substituting the form of the Gaussian distribution, we obtain the log-likelihood function as:
$$ \begin{align*} \ln p(\mathbf{t} | \mathbf{x}, \mathbf{w}, \beta) = \frac{\beta}{2} \sum_{n=1}^{N} {y(x_n, \mathbf{w}) - t_n}^2 + \frac{N}{2} \ln \beta - \frac{N}{2} \ln (2\pi) \tag{3.3} \end{align*} $$
Now, we consider determining the maximum likelihood estimate for the polynomial coefficients, denoted $\mathbf{w}_{ML}$, by maximizing $(3.3)$ with respect to $\mathbf{w}$. For this purpose, we can ignore the last two terms on the right-hand side of $(3.3)$ as they do not depend on $\mathbf{w}$. Furthermore, multiplying the log-likelihood by a positive constant does not change the maximum with respect to $\mathbf{w}$, so we can replace the coefficient $\beta/2$ with $1/2$. Finally, instead of maximizing the log-likelihood, we can minimize the negative log-likelihood. Thus, maximizing the likelihood is equivalent to minimizing the squared error function defined by (1.2), which appears as a consequence of maximizing the likelihood under the assumption of Gaussian noise.
We can also use maximum likelihood to determine the precision parameter $\beta$ of the conditional Gaussian distribution. Maximizing $(3.3)$ with respect to $\beta$ gives:
$$ \begin{align*} \frac{1}{\beta_{ML}} = \frac{1}{N} \sum_{n=1}^{N} { y(x_n, \mathbf{w}_{ML}) - t_n }^2 \tag{3.4} \end{align*} $$
Once we have determined the parameters $\mathbf{w}_{ML}$ and $\beta_{ML}$, we can predict new values of $x$. Since we now have a probabilistic model, these values are represented as a predictive distribution for $t$, rather than just a point estimate, obtained by substituting the maximum likelihood parameters into $(3.1)$:
$$ \begin{align*} p(t | x, \mathbf{w}_{ML}, \beta_{ML}) = \mathcal{N} (t | y(x, \mathbf{w}_{ML}), \beta^{-1}_{ML}) \tag{3.5} \end{align*} $$
Now we take a step closer to the Bayesian method by introducing a prior distribution for the polynomial coefficients $\mathbf{w}$. For simplicity, we consider a Gaussian distribution:
$$ \begin{align*} p(\mathbf{w} | \alpha) = \mathcal{N} \left( \mathbf{w} | \mathbf{0}, \alpha^{-1} \mathbf{I} \right) = \left( \frac{\alpha}{2\pi} \right)^{(M+1)/2} \exp \left( -\frac{\alpha}{2} \mathbf{w}^T \mathbf{w} \right) \tag{3.6} \end{align*} $$
For the multivariate Gaussian distribution, the general formula is:
$$ \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}, \mathbf{\Sigma}) = \frac{1}{(2\pi)^{k/2} |\mathbf{\Sigma}|^{1/2}} \exp \left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^\mathrm{T} \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right) $$
Here:
- Normalization constant: $\left( \frac{\alpha}{2\pi} \right)^{(M+1)/2}$
- Exponential term: $\exp \left( -\frac{\alpha}{2} \mathbf{w}^\mathrm{T} \mathbf{w} \right)$
Here, $\alpha$ is the precision of the distribution, and $M+1$ is the number of elements in vector $\mathbf{w}$ for a polynomial of degree $M$. Hyperparameters such as $\alpha$ control significant aspects of the model and are often determined through optimization methods, like maximizing the posterior (MAP) or using cross-validation techniques.
Using Bayes’ theorem, the posterior distribution for $\mathbf{w}$ is proportional to the product of the prior distribution and the likelihood function:
$$ \begin{align*} p(\mathbf{w} | \mathbf{x}, \mathbf{t}, \alpha, \beta) \propto p(\mathbf{t} | \mathbf{x}, \mathbf{w}, \beta) p(\mathbf{w} | \alpha) \end{align*} \tag{3.7} $$
Now we can determine $\mathbf{w}$ by finding the most likely value of $\mathbf{w}$ for the data, i.e., by maximizing the posterior distribution. This technique is called maximum a posteriori (MAP). Taking the negative log of $(3.7)$ and combining with $(3.3)$ and (3.6), we find that maximizing the posterior is given by minimizing:
$$ \begin{align*} \frac{\beta}{2} \sum_{n=1}^{N} {y(x_n, \mathbf{w}) - t_n}^2 + \frac{\alpha}{2} \mathbf{w}^T \mathbf{w} \tag{3.8} \end{align*} $$
Thus, we see that maximizing the posterior distribution is equivalent to minimizing the regularized squared error function encountered previously in the form of (1.4), with the regularization parameter given by $\lambda = \alpha / \beta$.
4. Polynomial Curve Fitting Using Bayesian Approach
Although we have included the prior distribution $p(w|\alpha)$, we are still only performing a point estimate of $w$, which is insufficient to be a Bayesian method. In a fully Bayesian approach, we should consistently apply the rules of probability summation and multiplication, which requires, as we will see shortly, that we integrate over all values of $w$. This marginalization lies at the core of Bayesian methods for pattern recognition. In the curve fitting problem, we are given training data $\mathbf{x}$ and $\mathbf{t}$, along with a new test point $x$, and our goal is to predict the value of $t$. Therefore, we want to evaluate the predictive distribution $p(t|x, \mathbf{x}, \mathbf{t})$. Here, we will assume that the parameters $\alpha$ and $\beta$ are fixed and known (we will discuss how such parameters can be inferred from data within the Bayesian framework in subsequent discussions). A simple Bayesian approach merely corresponds to consistently applying the rules of summation and multiplication of probabilities, allowing the predictive distribution to be written as:
$$ p(t|x, \mathbf{x}, \mathbf{t}) = \int p(t|x, w) p(w|\mathbf{x}, \mathbf{t}) , dw \tag{4.1} $$
Here, $p(t|x, w)$ is given by $(3.1)$, and we have omitted the dependency on $\alpha$ and $\beta$ for notational simplicity. Here, $p(w|\mathbf{x}, \mathbf{t})$ is the posterior distribution over the parameters, and it can be obtained by normalizing the right side of $(3.7)$. We will see in Section 3.3 that, for problems such as the curve fitting example, the posterior distribution is Gaussian and can be evaluated analytically. Similarly, the integral in $(4.1)$ can also be performed analytically, resulting in the predictive distribution given by a Gaussian form:
$$ p(t|x, \mathbf{x}, \mathbf{t}) = \mathcal{N} \left( t | m(x), s^2(x) \right) \tag{4.2} $$
with the mean and variance given by:
$$ m(x) = \beta \phi(x)^T S \sum_{n=1}^N \phi(x_n) t_n \tag{4.3} $$
$$ s^2(x) = \beta^{-1} + \phi(x)^T S \phi(x) \tag{4.4} $$
Here, the matrix $S$ is given by
$$ S^{-1} = \alpha I + \beta \sum_{n=1}^N \phi(x_n) \phi(x_n)^T $$
where $I$ is the identity matrix, and we define the vector $\phi(x)$ with elements $\phi_i(x) = x^i$ for $i = 0, \ldots, M$.
We see that both the variance and the mean of the predictive distribution in $(4.2)$ depend on $x$.
The first term in $(4.4)$ represents the uncertainty in the predicted value of $t$ due to noise in the target variables and was previously represented in the maximum likelihood predictive distribution $(3.5)$ through $\beta^{-1}$.
However, the second term arises from the uncertainty in the parameters $w$ and is a consequence of the Bayesian approach.
The predictive distribution for the synthetic sinusoidal regression problem is illustrated in Figure 1.3. Fig.1.3 Comparison of polynomial graphs with different degrees M, represented by red curves, in fitting the original dataset
5. Analysis of Differences in Solving the Curve Fitting Problem Using Polynomial Function with MSE Loss and Bayesian Perspective:
1. Using Polynomial Function with MSE Loss:
Method:
- Objective function: Uses the Mean Squared Error (MSE) loss function, defined as: $$\text{MSE} = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y_i})^2$$ where $y_i$ is the actual value, $\hat{y_i}$ is the predicted value, and $N$ is the number of data points.
- Parameter Estimation: Through minimizing the MSE loss function. This is often done using gradient descent or least squares solutions.
Advantages:
- Simple and Intuitive: The use of MSE is direct and easy to understand.
- Fast Computation: This approach usually requires less computation and can use simple analytical solutions.
- Widely Used: Suitable for many basic fitting problems and is widely applied across fields.
Disadvantages:
- Ignores Uncertainty: MSE only minimizes the average error without accounting for uncertainty in predictions.
- Overfitting: Using high-degree polynomials makes the model prone to overfitting the training data, reducing generalization ability.
- No Prior Information: This approach does not incorporate any prior information about parameters or data.
2. Using Bayesian Perspective:
Method:
- Prior Distribution: Uses a prior distribution $p(w)$ for the parameters $w$ before observing data.
- Likelihood: Probability of observed data for a specific set of parameters $p(t|x, w)$.
- Posterior Distribution: Uses Bayes’ theorem to update the prior distribution to the posterior distribution $p(w|x, t)$: $$p(w|x, t) \propto p(w) p(t|x, w)$$
- Predictive Distribution: Integrates the posterior distribution with the likelihood to find the predictive distribution for a new data point: $$p(t|x, \mathbf{x}, \mathbf{t}) = \int p(t|x, w) p(w|\mathbf{x}, \mathbf{t}) , dw$$
Advantages:
- Accounts for Uncertainty: Bayesian method considers uncertainty in both data and parameters, providing a probability distribution rather than a point estimate.
- Better Generalization Ability: Using prior and posterior distributions, Bayesian methods often generalize better, especially with limited or noisy data.
- Incorporates Prior Information: Allows the combination of prior information and experience into the model.
Disadvantages:
- More Complex Calculations: Calculating the posterior distribution and integrating to find the predictive distribution requires more computational resources.
- Challenging Prior Selection: Choosing an appropriate prior distribution can be difficult and significantly affects the results.
Conclusion:
- Polynomial Function with MSE Loss is a simple and intuitive method, suitable for basic fitting problems, but is prone to overfitting and ignores uncertainty.
- The Bayesian approach is more complex but provides a more complete approach by accounting for uncertainty and using prior information, which often leads to more accurate and generalizable results.