Convolutional Neural Networks with PyTorch
6.1 From Fully Connected to Convolutional
Multilayer perceptrons are suitable for tabular data but not for high-dimensional perceptual data.
6.1.1 Invariance
6.1.2 Limitations of Multilayer Perceptrons
6.1.3 Convolution
Convolution measures the overlap between functions f and g when one is flipped and shifted by x. For discrete objects, the integral becomes a sum.
6.2 Image Convolution
6.2.1 Cross-Correlation Operation
Convolutional layers are misnamed because the operation they perform is actually cross-correlation, not convolution.
First, ignore the channel (third dimension) and see how to handle 2D image data and hidden representations. The shape of the convolution kernel window is determined by the kernel's height and width.
The output size is slightly smaller than the input size because the kernel's height and width are greater than 1. The output size is (input size - kernel size + 1) in both dimensions.
Next, implement this process in the corr2d function, which takes an input tensor X and a kernel tensor K and returns the output tensor Y.
import torch
from torch import nn
from d2l import torch as d2l
def corr2d(X, K): #@save
"""Compute 2D cross-correlation"""
h, w = K.shape
Y = torch.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
for i in range(Y.shape[0]):
for j in range(Y.shape[1]):
Y[i, j] = (X[i:i + h, j:j + w] * K).sum()
return Y
6.2.2 Convolutional Layer
Implement a 2D convolutional layer based on the corr2d function defined above. In the __init__ constructor, declare weight and bias as model parameters. The forward propagation function calls corr2d and adds the bias.
class Conv2D(nn.Module):
def __init__(self, kernel_size):
super().__init__()
self.weight = nn.Parameter(torch.rand(kernel_size))
self.bias = nn.Parameter(torch.zeros(1))
def forward(self, x):
return corr2d(x, self.weight) + self.bias
6.2.3 Edge Detection in Images
Here's a simple application of convolutional layers: detecting edges between different colors in an image by finding positions where pixel values change. First, construct a 6×8 pixel black-and-white image. The middle four columns are black (0), and the rest are white (1).
X = torch.ones((6, 8))
X[:, 2:6] = 0
"""
tensor([[1., 1., 0., 0., 0., 0., 1., 1.],
[1., 1., 0., 0., 0., 0., 1., 1.],
[1., 1., 0., 0., 0., 0., 1., 1.],
[1., 1., 0., 0., 0., 0., 1., 1.],
[1., 1., 0., 0., 0., 0., 1., 1.],
[1., 1., 0., 0., 0., 0., 1., 1.]])
"""
Next, construct a convolution kernel K of height 1 and width 2. During cross-correlation, if two horizontally adjacent elements are the same, the output is zero; otherwise, it's non-zero.
K = torch.tensor([[1.0, -1.0]])
Now, perform cross-correlation on X (input) and K (kernel). The output Y shows 1 for edges from white to black, -1 for edges from black to white, and 0 otherwise.
Y = corr2d(X, K)
"""
tensor([[ 0., 1., 0., 0., 0., -1., 0.],
[ 0., 1., 0., 0., 0., -1., 0.],
[ 0., 1., 0., 0., 0., -1., 0.],
[ 0., 1., 0., 0., 0., -1., 0.],
[ 0., 1., 0., 0., 0., -1., 0.],
[ 0., 1., 0., 0., 0., -1., 0.]])
"""
6.2.4 Learning Convolution Kernels
Construct a convolutional layer with a randomly initialized kernel. Then, in each iteration, compare the output Y_hat with the target Y using squared error, compute gradients, and update the kernel. For simplicity, use PyTorch's built-in 2D convolutional layer and ignore the bias.
# Construct a 2D convolutional layer with 1 output channel and kernel size (1, 2), no bias
conv2d = nn.Conv2d(1, 1, kernel_size=(1, 2), bias=False)
# Use 4D input/output format (batch size, channels, height, width) with batch size and channels both 1
X = X.reshape((1, 1, 6, 8))
Y = Y.reshape((1, 1, 6, 7))
lr = 3e-2 # Learning rate
for i in range(10):
Y_hat = conv2d(X)
l = (Y_hat - Y) ** 2
conv2d.zero_grad()
l.sum().backward()
# Update the kernel
conv2d.weight.data[:] -= lr * conv2d.weight.grad
if (i + 1) % 2 == 0:
print(f'epoch {i+1}, loss {l.sum():.3f}')
6.2.5 Cross-Correlation vs. Convolution
Since convolution kernels are learned from data, the output of convolutional layers is unaffected whether they perform strict convolution or cross-correlation.
6.2.6 Feature Maps and Receptive Fields
The output of a convolutional layer is sometimes called a feature map because it acts as a transformer from input to spatial dimensions of the next layer. In a CNN, the receptive field of any element x in a layer refers to all elements from previous layers that could affect x during forward propagation.
The receptive field may be larger than the actual input size.
6.3 Padding and Stride
After applying consecutive convolutional layers, the output size may be much smaller than the input due to kernels larger then 1×1.
6.3.1 Padding
Adding padding (half on top/bottom and half on left/right) increases the output shape to (H + ph) × (W + pw). Often, ph = kh - 1 and pw = kw - 1.
6.3.2 Stride
When using vertical stride sh and horizontal stride sw, the output shape is [(H - kh + ph)/sh + 1] × [(W - kw + pw)/sw + 1].
6.4 Multiple Input and Output Channels
Color images have standard RGB channels. Input and hidden representations become 3D tensors with shape (channels, height, width). For example, RGB images have shape (3, H, W).
6.4.1 Multiple Input Channels
For multi-channel inputs, the convolution kernel must have the same number of input channels. Each input channel has a (kh × kw) kernel, and all channels are concatenated into a (ci, kh, kw) kernel.
import torch
from d2l import torch as d2l
def corr2d_multi_in(X, K):
# Iterate over channel dimensions of X and K, then sum the results
return sum(d2l.corr2d(x, k) for x, k in zip(X, K))
6.4.2 Multiple Output Channels
Increase output channels with depth to extract diverse features. The kernel shape is (co, ci, kh, kw).
def corr2d_multi_in_out(X, K):
# Iterate over output channels of K, perform cross-correlation, and stack results
return torch.stack([corr2d_multi_in(X, k) for k in K], 0)
6.4.3 1×1 Convolution Layers
1×1 convolutions lose the ability to detect spatial interactions but act as per-pixel fully connected layers, linearly combining channels.
def corr2d_multi_in_out_1x1(X, K):
c_i, h, w = X.shape
c_o = K.shape[0]
X = X.reshape((c_i, h * w))
K = K.reshape((c_o, c_i))
Y = torch.matmul(K, X)
return Y.reshape((c_o, h, w))
6.5 Pooling Layers
Pooling layers downsample feature maps to reduce parameters, computational complexity, and prevent overfitting.
6.5.1 Max and Average Pooling
Pooling layers have no learnable parameters.
import torch
from torch import nn
from d2l import torch as d2l
def pool2d(X, pool_size, mode='max'):
p_h, p_w = pool_size
Y = torch.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1))
for i in range(Y.shape[0]):
for j in range(Y.shape[1]):
if mode == 'max':
Y[i, j] = X[i: i + p_h, j: j + p_w].max()
elif mode == 'avg':
Y[i, j] = X[i: i + p_h, j: j + p_w].mean()
return Y
6.5.2 Padding and Stride
Use 4D input format (batch, channels, height, width).
X = torch.arange(16, dtype=torch.float32).reshape((1, 1, 4, 4))
# Default stride equals pool size
pool2d = nn.MaxPool2d(3)
print(pool2d(X))
# Output: tensor([[[[10.]]]])
# Custom pool size, stride, and padding
pool2d = nn.MaxPool2d((2, 3), stride=(2, 3), padding=(0, 1))
print(pool2d(X))
6.5.3 Multiple Channels
Pooling operates independently on each input channel, so output channels equal input channels.
6.6 LeNet
Using convolutional layers instead of fully connected layers makes models simpler and uses fewer parameters.
6.6.1 LeNet Architecture
import torch
from torch import nn
from d2l import torch as d2l
net = nn.Sequential(
nn.Conv2d(1, 6, kernel_size=5, padding=2), nn.Sigmoid(),
nn.AvgPool2d(kernel_size=2, stride=2),
nn.Conv2d(6, 16, kernel_size=5), nn.Sigmoid(),
nn.AvgPool2d(kernel_size=2, stride=2),
nn.Flatten(),
nn.Linear(16 * 5 * 5, 120), nn.Sigmoid(),
nn.Linear(120, 84), nn.Sigmoid(),
nn.Linear(84, 10))
# Test network structure
X = torch.rand(size=(1, 1, 28, 28), dtype=torch.float32)
for layer in net:
X = layer(X)
print(layer.__class__.__name__, 'output shape: ', X.shape)
6.6.2 Model Training
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)
def evaluate_accuracy_gpu(net, data_iter, device=None):
"""Compute accuracy using GPU"""
if isinstance(net, nn.Module):
net.eval()
if not device:
device = next(iter(net.parameters())).device
metric = d2l.Accumulator(2)
with torch.no_grad():
for X, y in data_iter:
if isinstance(X, list):
X = [x.to(device) for x in X]
else:
X = X.to(device)
y = y.to(device)
metric.add(d2l.accuracy(net(X), y), y.numel())
return metric[0] / metric[1]
def train_ch6(net, train_iter, test_iter, num_epochs, lr, device):
"""Train model using GPU"""
def init_weights(m):
if type(m) == nn.Linear or type(m) == nn.Conv2d:
nn.init.xavier_uniform_(m.weight)
net.apply(init_weights)
print('training on', device)
net.to(device)
optimizer = torch.optim.SGD(net.parameters(), lr=lr)
loss = nn.CrossEntropyLoss()
animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
legend=['train loss', 'train acc', 'test acc'])
timer, num_batches = d2l.Timer(), len(train_iter)
for epoch in range(num_epochs):
metric = d2l.Accumulator(3)
net.train()
for i, (X, y) in enumerate(train_iter):
timer.start()
optimizer.zero_grad()
X, y = X.to(device), y.to(device)
y_hat = net(X)
l = loss(y_hat, y)
l.backward()
optimizer.step()
with torch.no_grad():
metric.add(l * X.shape[0], d2l.accuracy(y_hat, y), X.shape[0])
timer.stop()
train_l = metric[0] / metric[2]
train_acc = metric[1] / metric[2]
if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
animator.add(epoch + (i + 1) / num_batches,
(train_l, train_acc, None))
test_acc = evaluate_accuracy_gpu(net, test_iter)
animator.add(epoch + 1, (None, None, test_acc))
print(f'loss {train_l:.3f}, train acc {train_acc:.3f}, ' \
f'test acc {test_acc:.3f}')
print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec ' \
f'on {str(device)}')
lr, num_epochs = 0.9, 10
train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())