Bayesian Optimization: Surrogate Models, Acquisition Functions, and Practical Implementation
Problem Setting
Consider the task of minimizing an expensive, black-box function \(f: \mathcal{X}\to\mathbb{R}\) where the analytic form is unknown and only point-wise evaluations are available. In machine-learning terms, this is the hyper-parameter tuning problem: given a validation metric \(R^2(\theta)\), find \(\theta^*\in\Theta\) that maximizes it while keeping the number of full training runs small.
Core Loop
Bayesian optimization alternates between two operations:
- Surrogate model: Fit a probabilistic regression model on the collected \(\{(x_i, y_i)\}\) pairs.
- Acquisition function: Use the posterior of the surrogate to decide the next query point.
Surrogate Models
Gaussian Process Regresssion
A Gaussian process (GP) places a prior over functions:
\[ f(\cdot)\sim\mathcal{GP}\bigl(m(\cdot),k(\cdot,\cdot)\bigr) \] For simplicity we set the mean function \(m(x)=0\). The kernel \(k\) encodes smoothness; the squared-exponential (RBF) kernel is common:
\[ k(x_i,x_j)=\sigma^2\exp\!\left(-\frac{\|x_i-x_j\|^2}{2\ell^2}\right) \] Given noisy observations \(y_i=f(x_i)+\varepsilon_i\) with \(\varepsilon_i\sim\mathcal{N}(0,\sigma_n^2)\), the posterior at a test location \(x_*\) is
\[ \begin{aligned} \mu(x_*)&=\mathbf{k}_*^T(K+\sigma_n^2I)^{-1}\mathbf{y},\\ \sigma^2(x_*)&=k(x_*,x_*)-\mathbf{k}_*^T(K+\sigma_n^2I)^{-1}\mathbf{k}_*, \end{aligned} \] where \(K_{ij}=k(x_i,x_j)\) and \(\mathbf{k}_*=[k(x_*,x_1),\dots,k(x_*,x_n)]^T\). ### Tree-structured Parzen Estimator (TPE)
Instead of modeling \(p(y|x)\), TPE splits the history into two densities:
\[ p(x|y)\approx \begin{cases} \ell(x)=p(x|yThe next query maximizes the expected improvement ratio \(\ell(x)/g(x)\), enabling categorical variables and conditional spaces. Acquisition Functions
Probability of Improvement (PI)
\[ \alpha_{\text{PI}}(x)=\Phi\!\left(\frac{\mu(x)-f^+-\xi}{\sigma(x)}\right),\quad \xi\ge 0 \] ### Expected Improvement (EI)
\[ \alpha_{\text{EI}}(x)=\sigma(x)\!\left[Z\Phi(Z)+\phi(Z)\right],\quad Z=\frac{\mu(x)-f^+}{\sigma(x)} \] where \(\phi,\Phi\) are the standard normal pdf and cdf. ### Upper/Lower Confidence Bound
\[ \alpha_{\text{UCB}}(x)=\mu(x)+\beta^{1/2}\sigma(x),\qquad \alpha_{\text{LCB}}(x)=\mu(x)-\beta^{1/2}\sigma(x) \] For GPs, \(\beta_t=2\log(|D|t^2\pi^2/6\delta)\) yields the GP-UCB regret bound.
End-to-End Example: SVR Hyper-parameter Search
from bayes_opt import BayesianOptimization
from sklearn.svm import SVR
from sklearn.model_selection import KFold
from sklearn.metrics import r2_score
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
def cv_score(C, epsilon, gamma):
folds = KFold(n_splits=5, shuffle=True, random_state=1)
scores = []
for tr, va in folds.split(X_train):
model = SVR(C=C, epsilon=epsilon, gamma=gamma)
model.fit(X_train[tr], y_train[tr])
scores.append(r2_score(y_train[va], model.predict(X_train[va])))
return sum(scores) / len(scores)
pbounds = {'C': (1, 16), 'epsilon': (0, 1), 'gamma': (0, 1)}
opt = BayesianOptimization(cv_score, pbounds, random_state=42)
opt.maximize(init_points=5, n_iter=25)
best = opt.max
print(best)
# {'target': 0.87, 'params': {'C': 12.4, 'epsilon': 0.08, 'gamma': 0.04}}
svr_best = SVR(**best['params'])
svr_best.fit(X_train, y_train)
print("Test R²:", r2_score(y_test, svr_best.predict(X_test)))
# Test R²: 0.91
Visualizing the GP Surrogate
import numpy as np
import matplotlib.pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF
rng = np.random.RandomState(0)
X_obs = rng.uniform(-5, 5, 20).reshape(-1, 1)
y_obs = np.sin(X_obs).ravel() + 0.2*rng.randn(20)
gp = GaussianProcessRegressor(kernel=RBF(length_scale=1.0),
alpha=0.1**2, normalize_y=True)
gp.fit(X_obs, y_obs)
X_plot = np.linspace(-6, 6, 400).reshape(-1, 1)
mu, std = gp.predict(X_plot, return_std=True)
plt.figure(figsize=(8, 4))
plt.scatter(X_obs, y_obs, c='k', zorder=10, label='Observations')
plt.plot(X_plot, mu, 'b', label='Posterior mean')
plt.fill_between(X_plot.ravel(), mu-1.96*std, mu+1.96*std,
color='blue', alpha=0.2, label='95% CI')
plt.legend(); plt.show()
Experimental Design in the Sciences
Bayesian optimization is equally useful when the objective is a wet-lab measurement. A minimal template:
def black_box(a, b, c, d, e):
x = np.array([[a, b, c, d, e]])
x_scaled = scaler.transform(x)
return model.predict(x_scaled)[0]
optimizer = BayesianOptimization(
f=black_box,
pbounds={'a': (0, 1), 'b': (0, 1), 'c': (0, 1),
'd': (0, 1), 'e': (0, 1)},
random_state=7
)
optimizer.maximize(init_points=30, n_iter=50)
print("Best protocol:", optimizer.max)
Replace black_box with any expensive assay, and the loop will propose the next experimental conditions to run.