Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Visualizing PCA in R with ggbiplot: sample scatter, group ellipses, and loading vectors

Tech 1

Overview

ggbiplot is a small R package that renders PCA results with ggplot2. It works directly with prcomp() or princomp() outputs and can:

  • Plot samples in the PCA space
  • Color by groups with optional Normal ellipses
  • Display variable loading vectors and an optional correlation circle
  • Produce scree plots of explained variance

The package provides two main functions: ggbiplot() for biplots and ggscreeplot() for variance diagnostics.

Installation and quick start

# Install from GitHub (pick one approach)
# install.packages("devtools"); devtools::install_github("vqv/ggbiplot")
# or
# install.packages("remotes"); remotes::install_github("vqv/ggbiplot")

library(ggbiplot)
library(ggplot2)

# Demo data bundled with ggbiplot
data(wine)              # loads 'wine' and 'wine.class'

# Fit PCA (scale to unit variance so loadings are comparable)
wine_fit <- prcomp(wine, scale. = TRUE)

# Scree plot of explained variance
# (useful to decide how many PCs to inspect)
ggscreeplot(wine_fit, type = "bar", ncomp = 6)

# Biplot with groups, 68% Normal ellipses, and the correlation circle
p <- ggbiplot(
  wine_fit,
  obs.scale  = 1,
  var.scale  = 1,
  groups     = wine.class,
  ellipse    = TRUE,
  circle     = TRUE,
  alpha      = 0.8,
  varname.abbrev = TRUE
) +
  scale_color_discrete(name = NULL) +
  theme(legend.position = "top", legend.direction = "horizontal")

p

# For comparison: a very plain base R scatter of the first two PCs
plot(wine_fit$x[, 1:2], pch = 19, col = as.integer(wine.class))

How to read the biplot

  • PC1 and PC2 axes show the fraction of total variance explained by each component
  • Points are samples; color encodes group membership
  • Ellipses summarize the group distribution under a Normal asssumption (by default, ~68% probability mass)
  • Arrows are variable loadings; arrow length indicates contribution magnitude and direction indicates correlation with the PCs

Working with microbiome OTU tables

The following example demonstrates PCA on an OTU count matrix with a sample-level design file. Replace file paths with your own.

library(ggbiplot)
library(ggplot2)

# Read study design (rows are samples)
meta <- read.table("design.txt", header = TRUE, row.names = 1, sep = "\t", check.names = FALSE)

# Read OTU/ASV table (rows are taxa/OTUs, columns are samples)
otu_raw <- read.delim("otu_table.txt", header = TRUE, row.names = 1, sep = "\t", check.names = FALSE)

# Align samples present in both tables and preserve a common order
common_ids <- intersect(rownames(meta), colnames(otu_raw))
meta_sub   <- meta[common_ids, , drop = FALSE]
counts     <- otu_raw[, common_ids, drop = FALSE]

# PCA on samples: transpose so samples are rows
# Scale = TRUE treats all OTUs on a comparable scale (useful when counts differ widely)
ota_fit <- prcomp(t(counts), scale. = TRUE)

# Group by a design factor (e.g., genotype)
# Replace 'genotype' with the column you want to use
b <- ggbiplot(
  ota_fit,
  groups     = meta_sub$genotype,
  ellipse    = TRUE,
  obs.scale  = 1,
  var.scale  = 1,
  var.axes   = FALSE,
  alpha      = 0.7
) +
  scale_color_brewer(palette = "Set1", name = "Group") +
  theme_minimal() +
  theme(legend.position = "top")

b

If groups are well separated along PC1/PC2, that indicates strong between-group structure in the data.

Focusing on high-variance taxa and showing correlations

To highlight taxa that drive separation, subset to the most variable features and re-run the PCA with loading vectors and the correlation circle.

# Convert counts to column-wise percentages (relative abundance)
rel_abund <- sweep(counts, 2, colSums(counts, na.rm = TRUE), FUN = "/") * 100

# Option A: keep taxa with MAD > 0.5
mad_vals <- apply(rel_abund, 1, mad, na.rm = TRUE)
high_var_taxa <- rel_abund[mad_vals > 0.5, , drop = FALSE]

# Option B: take the top N most variable taxa by MAD
N <- 6
ord <- order(mad_vals, decreasing = TRUE)
high_var_topN <- rel_abund[ord[seq_len(min(N, length(ord)))], , drop = FALSE]

# Choose one of the subsets to proceed (here, top N)
core_mat <- high_var_topN

# PCA on the subset (transpose so samples are rows)
core_fit <- prcomp(t(core_mat), scale. = TRUE)

# Biplot with loading vectors and correlation circle enabled
c <- ggbiplot(
  core_fit,
  groups        = meta_sub$genotype,
  ellipse       = TRUE,
  var.axes      = TRUE,
  circle        = TRUE,
  obs.scale     = 1,
  var.scale     = 1,
  varname.size  = 3,
  varname.abbrev = FALSE
) +
  scale_color_brewer(palette = "Dark2", name = "Group") +
  theme_minimal() +
  theme(legend.position = "top")

c

Arrows pointing in similar directions indicate taxa positively correlated with each other and with that PC; arrows pointing in opposite directions suggest negative correlation. Longer arrows denote stronger contributions to the separation captured by the plotted PCs.

ggbiplot function arguments (quick reference)

  • pcobj: output of prcomp() or princomp()
  • choices: which PCs to plot (default c(1, 2))
  • scale: 1 for covariance biplot, 0 for form biplot; affects how observations and variables are scaled
  • obs.scale: scaling applied to the observation scores (samples)
  • var.scale: scaling applied to variable loadings (arrows)
  • pc.biplot: match biplot.princomp() conventions if TRUE
  • groups: factor (or vector coercible to factor) for coloring samples
  • ellipse: draw Normal probability ellipses per group when TRUE
  • ellipse.prob: probability mass for ellipses (default 0.68)
  • labels: optional character vector for sample labels
  • labels.size: size of sample labels
  • alpha: point transparency for samples (0 transparent, 1 opaque)
  • var.axes: draw variable loading vectors (arrows) when TRUE
  • circle: draw the unit correlation circle; meaningful when prcomp(scale.=TRUE) and var.scale = 1
  • circle.prob: probability used to compute the circle radius (default ≈ 0.69)
  • varname.size: text size for variable names
  • varname.adjust: nudges labels away from arrow tips (≥ 1 pushes labels farther)
  • varname.abbrev: abbreviate varible names when TRUE
  • ...: additional ggplot2 layer arguments (e.g., color scales, themes)
Tags: RPCA

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.