Home > Tools > Content

GB-Score: A Gradient Boosted Tree Scoring Function for Protein–Ligand Binding Affinity Prediction

Tools May 13 2

Motivation and Background

Molecular docking remains a cornerstone of structure-based drug design, consisting of pose generation and subsequent evaluation via scoring functions. Advances in data availability and computational resources have driven the adoption of machine learning (ML) to build more accurate and robust scoring functions. GB-Score is a state-of-the-art ML-based scoring function that leverages gradient boosted trees and distance-weighted atom contact features to directly predict binding affinities.

GB-Score uses the refined general set of PDBbind v2019 (23 496 complexes) for training. The method represents each protein–ligand complex as a vector of distance-weighted interatomic contacts, and the tree ensemble model achieves a Pearson correlation of 0.862 and RMSE of 1.190 on the CASF-2016 benchmark.

The freely available implementation and relevant publications can be found here:

Feature Representation and Model Design

Atom Typing and Distance Encoding

Ligand atoms are classified purely by element (H, C, N, O, F, P, S, Cl, Br, I). For proteins, residues are first grouped by side-chain chemical character: charged (c), polar (p), amphipathic (a), and hydrophobic (h). Within each group, atoms receive an elemental label. This stratification captures the local chemical environment.

For each pair of protein and ligand atom types, interatomic distances are computed. Distances below a cutoff (d_{cutoff} = 12, \text{Å}) are weighted by the inverse power (n=2) and summed. Performing this operation across all type pairs yields a 400-dimensional feature vector representing the complex.

Preprocessing and Feature Selection

Redundant and low-variance features are removed during preprocessing:

Static and quasi-static features (variance < 0.01) are discarded.
Highly correlated features (correlation > 0.95) are eliminated, reducing dimensionality in a dataset-dependent manner.

Remaining features are standardized (zero mean, unit variance).

Training Strategy and Hyperparameters

Three ensemble algorithms—Random Forest (RF), Extremely Randomized Trees (ERT), and Gradient Boosted Trees (GBT)—were evaluated using scikit-learn. RF and ERT used 500 estimators and tuned only the max_features parameter. GBT hyperparameters were adopted verbatim from the reference paper.

To mitigate randomness, each training was repeated ten times, and the reported RMSE and Pearson correlation represent averaged scores across the ten independent runs.

Performance Characteristics

GBT outperformed the other two algorithms when assessed on CASF-2016 core sets. Training on the entire PDBbind v2019 refined set produced the final GB-Score model ((R_p = 0.862), RMSE = 1.190).

A notable limitation appears for complexes with experimental (pK_i/pK_d > 10), where GB-Score produces larger errors. Only 1.80% of the training data fall in this range, causing the model to be biased towards mid-range affinities. Future efforts can address this by augmenting the training set with high-affinity examples.

Five-fold cross-validation across the full PDBbind v2019 dataset yielded (R_p = 0.764,(0.001)) and RMSE = 1.205 (0.007). The performance drop reflects increased data diversity and volume. Systematic experiments adjusting CASF-2016 training set size and controlling data similarity confirm the robustness of the GBT algorithm even when similarity between training and test data decreases.

When the CASF-2016 core set is split into 57 protein families, 75% of these families show predictions with a correlation coefficient above 0.7, considered acceptable for virtual screening.

Among comparable scoring functions, GB-Score delivers competitive results:

ECIF::LD-GBT: (R_p = 0.866)
GB-Score: (R_p = 0.862)

Other functions evaluated include ECIF, AGL-Score, ETScore, EIC-Score, RosENet, KDEEP, PLEC-nn, OnionNet, DvinaRF20, RI-Score, and X-score.

Environment Setup and Validation

Virtual Environment Creation

conda create -n gb_score_env python=3.8.8 numpy=1.21.2 pandas=1.2.4 seaborn=0.11.1 joblib=1.0.1 matplotlib=3.3.4
conda activate gb_score_env
python -m pip install biopandas==0.2.8 scipy==1.7.1 scikit-learn==0.24.1 progressbar2==3.53.1
conda install jupyter

Input Preparation

Place each complex’s ligand (.mol2) and protein (.pdb) files in a dedicated folder. For example:

./1a1e/1a1e_ligand.mol2
./1a1e/1a1e_protein.pdb
./1a4k/1a4k_ligand.mol2
./1a4k/1a4k_protein.pdb

Feature Generation

Run generate_features.py to compute the 400-dimensional feature vector and export a CSV file. The -d flag specifies the input directory; -f names the output file.

python generate_features.py -d score/score_in/ -f feature.csv

Reproducing and Extending Results

analysis.ipynb contains detailed steps to replicate the entire design and validation pipeline. Required pre-computed files (.csv and .joblib) must be downloaded and placed inside the files and saved_model directories before running the notebook.

GB-Score demonstrates that carefully engineered distance-weighted atom contacts combined with gradient boosted trees can achieve binding affinity predictions that rival or approach the top-performing methods on community benchmarks.

Tags: molecular docking scoring function

Back to List

Prev: Choosing Between Append-Optimized Row and Column Storage in Greenplum

Next: Bulk Loading Excel Data into SQLite Using Python

Fading Coder

GB-Score: A Gradient Boosted Tree Scoring Function for Protein–Ligand Binding Affinity Prediction

Motivation and Background

Feature Representation and Model Design

Atom Typing and Distance Encoding

Preprocessing and Feature Selection

Training Strategy and Hyperparameters

Performance Characteristics

Environment Setup and Validation

Virtual Environment Creation

Input Preparation

Feature Generation

Reproducing and Extending Results

Related Articles

Efficient Usage of HTTP Client in IntelliJ IDEA

Installing CocoaPods on macOS Catalina (10.15) Using a User-Managed Ruby

Resolve PhpStorm "Interpreter is not specified or invalid" on WAMP (Windows)

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

GB-Score: A Gradient Boosted Tree Scoring Function for Protein–Ligand Binding Affinity Prediction

Motivation and Background

Feature Representation and Model Design

Atom Typing and Distance Encoding

Preprocessing and Feature Selection

Training Strategy and Hyperparameters

Performance Characteristics

Environment Setup and Validation

Virtual Environment Creation

Input Preparation

Feature Generation

Reproducing and Extending Results

Related Articles

Efficient Usage of HTTP Client in IntelliJ IDEA

Installing CocoaPods on macOS Catalina (10.15) Using a User-Managed Ruby

Resolve PhpStorm "Interpreter is not specified or invalid" on WAMP (Windows)

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment