Home > Tech > Content

Bayesian Optimization: A Comprehensive Guide

Tech May 13 2

This article provides a comprehensive introduction to Bayesian Optimization, combining insights from several research papers and blogs:

A Survey on Bayesian Optimization: https://doi.org/10.13328/j.cnki.jos.005607Visualizing Gaussian Processes: https://jgoertler.com/visual-exploration-gaussian-processes/Bayesian Optimization: http://arxiv.org/abs/1012.2599

Introduction

Problem 1: If we have a function y = x², it's straightforward to find its minimum at x = 0. However, if we only know that a function y = f(x) exists (without its specific expression), how can we find its minimum?

Problem 2: Machine learning and deep learning models are determined by numerous parameters (e.g., learning rate, network depth). If we evaluate our model using R², how can we select parameter values to maximize R²?

Should we use Grid Search, Random Search, or Bayesian Optimization?

This article focuses on Bayesian Optimization, a method for global optimization through a limited number of steps. We define our function to optimize as:

[x^{*} = \underset{x \in X}{argmin} f(x)]

In this equation:

x represents the decision vector (intuitively: parameters like learning rate, network depth in deep learning) X represents the decision space (intuitively: the set of possible values for parameters, e.g., α = (0.01, 0.02, 0.03) for learning rate) f represents the objective function (e.g., R² or the performance metric of a machine learning model)

Many optimization problems in machine learning are black-box optimization problems[1], where the function is a black box. How can we achieve the optimization in equation (1) using Bayesian Optimization? Bayesian Optimization relies on two key components:

Surrogate model Acquisition function

The Bayesian Optimization framework is illustrated below[2]:

Example of applying Bayesian Optimization framework to a one-dimensional function f(x) = (x-0.3)² + 0.2sin(20x) for 3 iterations:

Surrogate Models

Gaussian Processes (GP)

As mentioned, machine learning often involves a black box where we only know inputs and outputs, making it difficult to determine the exact functional relationship[3]. Since we can't determine the exact function relationship, we can find a model to proxy our function. This is the first component of Bayesian Optimization: the surrogate model (using probabilistic models to replace the original, expensive-to-evaluate complex objective function).

This section focuses on Gaussian Processes (GP)[4].

Other surrogate models of interest can be found in this paper.

A Gaussian Process is defined as:

[f(x) \sim GP(m(x), k(x, x'))]

where:

m(x) represents the mean (for simplicity, we set m(x) = 0) k represents the kernel function

Common kernel functions include:

[k(x_i, x_j) = cov(x_i, x_j) = exp(-\frac{1}{2}||x_i - x_j||^2)]

Other kernel functions are also available. In Gaussian Processes, the kernel function determines the shape of the distribution and consequently the characteristics of the function we want to predict. For two points x_i and x_j that are close, the value approaches 1; for distant points, it approaches 0.

The kernel matrix can then be obtained as:

[K = \begin{bmatrix} k(x_1, x_1) & \ldots & k(x_1, x_t) \ \ldots & \ldots & \ldots \ k(x_t, x_1) & \ldots & k(x_t, x_t) \end{bmatrix}]

For a regression task[4:1], a Gaussian Process defines a probability distribution over potential functions. Since this is a multivariate Gaussian distribution, these functions are also normally distributed. Typically, we assume μ = 0 before observing any training data. In the framework of Bayesian inference, this is called the prior distribution f(x) ~ N(μ_f, K_f). Without observing any training samples, this distribution spreads around μ_f = 0 (we can define this prior distribution as f(x) ~ N(0, K_f)). The dimension of the prior distribution matches the number of test points N = |X|. We'll use the kernel function to build the covariance matrix with dimensions N×N.

When training samples are added, we get:

Given input sample points, the mathematical principle is as follows: Suppose we observe sample points (x, y), then the joint Gaussian distribution of y and the prior distribution f(x) is:

[\begin{bmatrix} f(x) \ y \end{bmatrix} \sim N\left(\begin{bmatrix} 0 \ 0 \end{bmatrix}, \begin{bmatrix} K_{ff} & K_{fy} \ K_{fy}^T & K_{yy} \end{bmatrix}\right)]

Then, we can derive the distribution of P(y|f) from the joint distribution:

[P(y|f) = N(\mu(x), \sigma^2(x))]

where:

μ(x) = K_{fy}^T K_{ff}^{-1} f(x) σ²(x) = K_{yy} - K_{fy}^T K_{ff}^{-1} K_{fy}

Understanding Gaussian Processes from a regression perspective: Suppose we want to fit the function:

[y = sin(2.5x) + sin(x) + 0.05x^2 + 1]

We generate input data by setting the range of x, which gives us output data y. The GP fitting is as follows:

The figure is easily understood: for x < 10, we have input data, so the confidence interval is small. For x > 10, without input data, the confidence interval is larger.

TPE (Tree-structured Parzen Estimator)

In Gaussian Processes, we use p(y|x) to model the relationship between inputs and outputs.

Acquisition Functions

As described in the paper, the role of the acquisition function is to guide the search for the optimum.

Personally, I understand this as: In the previous section, we discussed how introducing new data points affects the joint distribution in GP. We could theoretically add n points to cover all x values, but this would defeat the purpose of Bayesian Optimization. The question is how to achieve x* = argmin f(x) with the fewest possible points.

Acquisition functions are defined such that high acquisition corresponds to potentially high values of the objective function.

We can understand acquisition functions as: finding an appropriate point. Common acquisition functions enclude:

Probability of Improvement (PI)

PI atempts to maximize the probability of improving upon the current best value f(x+), where x+ = argmax f(x). The formula is:

[PI_{t}(x) = P(f_{t}(x) \geq f_{t}(x^+) + \xi) = \phi\left(\frac{\mu_{t}(x) - f_{t}(x^+) - \xi}{\sigma_{t}(x)}\right)]

where φ(.) is the standard normal cumulative distribution function, and ξ (ξ ≥ 0) is a weight parameter. The PI strategy selects the next hyperparameter combination by maximizing the PI improvement:

[x_{t+1} = argmax_{x}(PI_{t}(x))]

where x_{t+1} represents the next hyperparameter combination.

Expected Improvement (EI)

The PI strategy selects the candidate point with the highest probability of improvement, considering only the probability of improvement without considering the magnitude of improvement. EI addresses this by defining:

[EI(x) = E[max(f_{t+1}(x) - f(x^+), 0)]]

The EI function is then:

[f(n) = \begin{cases} (\mu(x) - f(x^+))\phi(Z) + \sigma(x)\phi(Z) & \text{if } \sigma(x) > 0 \ 0 & \text{if } \sigma(x) = 0 \end{cases}]

where Z = (μ(x) - f(x+))/σ(x)

For the specific formula derivation, see the paper (page 13): http://arxiv.org/abs/1012.2599

Confidence Bound Criteria
LCB (Lower Confidence Bound)

Used for minimizing the objective function:

[LCB(x) = \mu(x) - \kappa \sigma(x)]

UCB (Upper Confidence Bound)

Used for maximizing the objective function:

[UCB(x) = \mu(x) + \kappa \sigma(x)]

In LCB and UCB, κ is left to the user to determine.

GP-UCB

A simple acquisition strategy that selects the next hyperparameter combination by maximizing the upper confidence bound of the random variable:

[GP-UCB = \mu(x) + \sqrt{v\tau_{t}}\sigma(x)]

Others

See the paper: https://doi.org/10.13328/j.cnki.jos.005607

Summary

Common Surrogate Functions
Common Acquisition Functions

Code Examples

Bayesian Optimization for SVR Parameters

from bayes_opt import BayesianOptimization
from sklearn.svm import SVR
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split, KFold

# Split data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.2, random_state=100)

# Define cross-validation function for SVR
def svr_cv(C, epsilon, gamma):
    kf = KFold(n_splits=5, shuffle=True, random_state=100)
    svr = SVR(C=C, epsilon=epsilon, gamma=gamma)
    r2_scores = []
    
    for train_index, test_index in kf.split(X_train, Y_train):
        svr.fit(X_train[train_index], Y_train.values[train_index])
        pred = svr.predict(X_train[test_index])
        r2_scores.append(r2_score(pred, Y_train.values[test_index]))
    
    return sum(r2_scores) / len(r2_scores)  # Return average R² score

# Initialize Bayesian Optimization
svr_bo = BayesianOptimization(
    svr_cv,
    {'C': (1, 16), 'epsilon': (0, 1), 'gamma': (0, 1)}
)

# Run optimization
svr_bo.maximize()

# Get best parameters
print(svr_bo.max)

# Output example:
# {'target': 0.9875895309185105,
#  'params': {'C': 14.595794386042416,
#   'epsilon': 0.09480102745231553,
#   'gamma': 0.09251046201638335}}

# Train model with best parameters
best_svr = SVR(
    C=svr_bo.max['params']['C'],
    epsilon=svr_bo.max['params']['epsilon'],
    gamma=svr_bo.max['params']['gamma']
)
best_svr.fit(X_train, Y_train)

# Evaluate on test set
test_r2 = r2_score(Y_test.values, best_svr.predict(X_test))
print(f"Test R² score: {test_r2}")
# Output example: 0.9945825852230629

Gaussian Process Regression Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF

# Generate sample data
n_samples = 100
x_min = -10
x_max = 10
X = np.sort(np.random.uniform(size=n_samples) * (x_max - x_min) + x_min)
X = X.reshape(-1, 1)

# Add noise to observations
np.random.seed(42)
noise = np.random.normal(loc=0.0, scale=0.5, size=n_samples)

# True function and noisy observations
y_clean = np.sin(X * 2.5) + np.sin(X * 1.0) + np.multiply(X, X) * 0.05 + 1
y_clean = y_clean.ravel()
y = y_clean + noise

# Define kernel and Gaussian Process model
kernel = RBF(
    length_scale=1, 
    length_scale_bounds=(1e-2, 1e3)
)

gpr = GaussianProcessRegressor(
    kernel=kernel,
    alpha=0.1,
    n_restarts_optimizer=5,
    normalize_y=True
)

# Fit the model
gpr.fit(X, y)

# Generate predictions
x_test = np.linspace(x_min - 2.0, x_max + 7.5, n_samples * 2).reshape(-1, 1)
y_pred, y_pred_std = gpr.predict(x_test, return_std=True)

# Plot results
plt.figure(figsize=(15, 8))
plt.plot(x_test, y_pred, linewidth=3, label="GP mean")
plt.plot(X, y_clean, linewidth=3, label="Original y")
plt.plot(X, y, 'o', markersize=5, label="Noisy observations")
plt.fill_between(
    x_test.ravel(),
    y_pred - y_pred_std,
    y_pred + y_pred_std,
    label="95% confidence interval",
    interpolate=True,
    facecolor='blue',
    alpha=0.5
)
plt.xlim(x_min - 2, x_max + 7)
plt.legend()
plt.title("Gaussian Process Regression")
plt.xlabel("x")
plt.ylabel("y")
plt.show()

Recommendations

Gaussian Processes for Machine Learning: https://gaussianprocess.org/gpml/chapters/RW.pdfBayesian Optimization Paper: http://arxiv.org/abs/1012.2599Bayesian Optimization Blog: https://banxian-w.com/article/2023/3/27/2539.htmlVisualizing Gaussian Processes: https://jgoertler.com/visual-exploration-gaussian-processes/#MargCond

References

http://krasserm.github.io/2018/03/21/bayesian-optimization/https://zhuanlan.zhihu.com/p/53826787Cui, J., & Yang, B. (2018). A survey on Bayesian optimization methods and applications. Journal of Software, 29(10), 3068-3090. https://doi.org/10.13328/j.cnki.jos.005607 https://jgoertler.com/visual-exploration-gaussian-processes/http://arxiv.org/abs/1012.2599 https://www.cvmart.net/community/detail/3502 https://gaussienprocess.org/gpml/chapters/RW.pdf

Back to List

Prev: Understanding Selenium WebDriver Architecture and Operation Principles

Next: Abstract Classes vs Interfaces in PHP: Core Distinctions and Practical Rules

Fading Coder

Bayesian Optimization: A Comprehensive Guide

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Bayesian Optimization: A Comprehensive Guide

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment