Visualizing PCA in R with ggbiplot: sample scatter, group ellipses, and loading vectors
Overview
ggbiplot is a small R package that renders PCA results with ggplot2. It works directly with prcomp() or princomp() outputs and can:
- Plot samples in the PCA space
- Color by groups with optional Normal ellipses
- Display variable loading vectors and an optional correlation circle
- Produce scree plots of explained variance
The package provides two main functions: ggbiplot() for biplots and ggscreeplot() for variance diagnostics.
Installation and quick start
# Install from GitHub (pick one approach)
# install.packages("devtools"); devtools::install_github("vqv/ggbiplot")
# or
# install.packages("remotes"); remotes::install_github("vqv/ggbiplot")
library(ggbiplot)
library(ggplot2)
# Demo data bundled with ggbiplot
data(wine) # loads 'wine' and 'wine.class'
# Fit PCA (scale to unit variance so loadings are comparable)
wine_fit <- prcomp(wine, scale. = TRUE)
# Scree plot of explained variance
# (useful to decide how many PCs to inspect)
ggscreeplot(wine_fit, type = "bar", ncomp = 6)
# Biplot with groups, 68% Normal ellipses, and the correlation circle
p <- ggbiplot(
wine_fit,
obs.scale = 1,
var.scale = 1,
groups = wine.class,
ellipse = TRUE,
circle = TRUE,
alpha = 0.8,
varname.abbrev = TRUE
) +
scale_color_discrete(name = NULL) +
theme(legend.position = "top", legend.direction = "horizontal")
p
# For comparison: a very plain base R scatter of the first two PCs
plot(wine_fit$x[, 1:2], pch = 19, col = as.integer(wine.class))
How to read the biplot
- PC1 and PC2 axes show the fraction of total variance explained by each component
- Points are samples; color encodes group membership
- Ellipses summarize the group distribution under a Normal asssumption (by default, ~68% probability mass)
- Arrows are variable loadings; arrow length indicates contribution magnitude and direction indicates correlation with the PCs
Working with microbiome OTU tables
The following example demonstrates PCA on an OTU count matrix with a sample-level design file. Replace file paths with your own.
library(ggbiplot)
library(ggplot2)
# Read study design (rows are samples)
meta <- read.table("design.txt", header = TRUE, row.names = 1, sep = "\t", check.names = FALSE)
# Read OTU/ASV table (rows are taxa/OTUs, columns are samples)
otu_raw <- read.delim("otu_table.txt", header = TRUE, row.names = 1, sep = "\t", check.names = FALSE)
# Align samples present in both tables and preserve a common order
common_ids <- intersect(rownames(meta), colnames(otu_raw))
meta_sub <- meta[common_ids, , drop = FALSE]
counts <- otu_raw[, common_ids, drop = FALSE]
# PCA on samples: transpose so samples are rows
# Scale = TRUE treats all OTUs on a comparable scale (useful when counts differ widely)
ota_fit <- prcomp(t(counts), scale. = TRUE)
# Group by a design factor (e.g., genotype)
# Replace 'genotype' with the column you want to use
b <- ggbiplot(
ota_fit,
groups = meta_sub$genotype,
ellipse = TRUE,
obs.scale = 1,
var.scale = 1,
var.axes = FALSE,
alpha = 0.7
) +
scale_color_brewer(palette = "Set1", name = "Group") +
theme_minimal() +
theme(legend.position = "top")
b
If groups are well separated along PC1/PC2, that indicates strong between-group structure in the data.
Focusing on high-variance taxa and showing correlations
To highlight taxa that drive separation, subset to the most variable features and re-run the PCA with loading vectors and the correlation circle.
# Convert counts to column-wise percentages (relative abundance)
rel_abund <- sweep(counts, 2, colSums(counts, na.rm = TRUE), FUN = "/") * 100
# Option A: keep taxa with MAD > 0.5
mad_vals <- apply(rel_abund, 1, mad, na.rm = TRUE)
high_var_taxa <- rel_abund[mad_vals > 0.5, , drop = FALSE]
# Option B: take the top N most variable taxa by MAD
N <- 6
ord <- order(mad_vals, decreasing = TRUE)
high_var_topN <- rel_abund[ord[seq_len(min(N, length(ord)))], , drop = FALSE]
# Choose one of the subsets to proceed (here, top N)
core_mat <- high_var_topN
# PCA on the subset (transpose so samples are rows)
core_fit <- prcomp(t(core_mat), scale. = TRUE)
# Biplot with loading vectors and correlation circle enabled
c <- ggbiplot(
core_fit,
groups = meta_sub$genotype,
ellipse = TRUE,
var.axes = TRUE,
circle = TRUE,
obs.scale = 1,
var.scale = 1,
varname.size = 3,
varname.abbrev = FALSE
) +
scale_color_brewer(palette = "Dark2", name = "Group") +
theme_minimal() +
theme(legend.position = "top")
c
Arrows pointing in similar directions indicate taxa positively correlated with each other and with that PC; arrows pointing in opposite directions suggest negative correlation. Longer arrows denote stronger contributions to the separation captured by the plotted PCs.
ggbiplot function arguments (quick reference)
- pcobj: output of prcomp() or princomp()
- choices: which PCs to plot (default c(1, 2))
- scale: 1 for covariance biplot, 0 for form biplot; affects how observations and variables are scaled
- obs.scale: scaling applied to the observation scores (samples)
- var.scale: scaling applied to variable loadings (arrows)
- pc.biplot: match biplot.princomp() conventions if TRUE
- groups: factor (or vector coercible to factor) for coloring samples
- ellipse: draw Normal probability ellipses per group when TRUE
- ellipse.prob: probability mass for ellipses (default 0.68)
- labels: optional character vector for sample labels
- labels.size: size of sample labels
- alpha: point transparency for samples (0 transparent, 1 opaque)
- var.axes: draw variable loading vectors (arrows) when TRUE
- circle: draw the unit correlation circle; meaningful when prcomp(scale.=TRUE) and var.scale = 1
- circle.prob: probability used to compute the circle radius (default ≈ 0.69)
- varname.size: text size for variable names
- varname.adjust: nudges labels away from arrow tips (≥ 1 pushes labels farther)
- varname.abbrev: abbreviate varible names when TRUE
- ...: additional ggplot2 layer arguments (e.g., color scales, themes)