GoogLeNet Architecture with Parallel Connections
The 2014 ImageNet competition saw the emergence of GoogLeNet (Szegedy et al., 2015), a network architecture that achieved remarkable results. Building upon the Network in Network (NiN) concept, GoogLeNet introduced improvements particularly focused on determining optimal convolution kernel sizes. While previous networks used kernels renging from (1\times1) to (11\times11), this work demonstrated advantages in combining kernels of different sizes. The implementation below presents a simplified version of GoogLeNet, omitting certain training stabilization features that are less necessary with modern techniques.
Inception Modules
The fundamental building block in GoogLeNet is the Inception module, likely named after the film "Inception" and its theme of going deeper into layers. As illustrated in architectural diagrams, an Inception module consists of four parallel processing paths. The first three paths employ convolution layers with kernel sizes of (1\times1), (3\times3), and (5\times5) respectively to capture features at different spatial scales. The middle two paths incorporate (1\times1) convolutions before their larger convolutions to reduce channel dimensions and computational overhead. The fourth path applies a (3\times3) max pooling operation followed by a (1\times1) convolution for channel adjustment. All paths maintain spatial consistency through appropriate padding, and their outputs are concatenated along the channel axis to form the module's final output. Hyperparameter tuning typically involves adjusting the number of output channels per pathway.
import torch
from torch import nn
from torch.nn import functional as F
class InceptionModule(nn.Module):
def __init__(self, input_channels, out_1x1, red_3x3, out_3x3, red_5x5, out_5x5, pool_proj):
super(InceptionModule, self).__init__()
# Path 1: 1x1 convolution
self.branch1 = nn.Conv2d(input_channels, out_1x1, kernel_size=1)
# Path 2: 1x1 reduction followed by 3x3 convolution
self.branch2_reduce = nn.Conv2d(input_channels, red_3x3, kernel_size=1)
self.branch2_conv = nn.Conv2d(red_3x3, out_3x3, kernel_size=3, padding=1)
# Path 3: 1x1 reduction followed by 5x5 convolution
self.branch3_reduce = nn.Conv2d(input_channels, red_5x5, kernel_size=1)
self.branch3_conv = nn.Conv2d(red_5x5, out_5x5, kernel_size=5, padding=2)
# Path 4: 3x3 max pooling followed by 1x1 convolution
self.branch4_pool = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
self.branch4_conv = nn.Conv2d(input_channels, pool_proj, kernel_size=1)
def forward(self, x):
# Process each branch with ReLU activation
out1 = F.relu(self.branch1(x))
out2 = F.relu(self.branch2_conv(F.relu(self.branch2_reduce(x))))
out3 = F.relu(self.branch3_conv(F.relu(self.branch3_reduce(x))))
out4 = F.relu(self.branch4_conv(self.branch4_pool(x)))
# Combine outputs along channel dimension
return torch.cat([out1, out2, out3, out4], dim=1)
The effectiveness of GoogLeNet stems from its multi-scale filter approach, allowing simultaneous detection of both fine-grained and coarse features across different receptive fields. Additionally, it enables flexible parameter allocation across various filter types.
Network Structure
GoogLeNet employs nine Inception modules stacked with global average pooling for final predictions. Dimensionality reduction between Inception blocks is achieved through max pooling operations. The initial layers resemble AlexNet and LeNet architectures, while the Inception stacking pattern derives from VGG design principles, and global average pooling replaces traditional fully connnected layers.
Implementation proceeds module by module. The first stage utilizes a (7\times7) convolution with 64 output channels:
stage1 = nn.Sequential(
nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
)
Stage two incorporates dual convolutions - a (1\times1) layer followed by a (3\times3) convolution that triples channel count:
stage2 = nn.Sequential(
nn.Conv2d(64, 64, kernel_size=1),
nn.ReLU(),
nn.Conv2d(64, 192, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
)
Stage three connects two complete Inception modules. The first outputs 256 channels with a ratio of 64:128:32:32 across pathways. Intermediate paths reduce input channels to 1/2 and 1/12 proportions respectively. The second module expands to 480 total channels with ratios 128:192:96:64, applying reductions of 1/2 and 1/8 to respective intermediate paths:
stage3 = nn.Sequential(
InceptionModule(192, 64, 96, 128, 16, 32, 32),
InceptionModule(256, 128, 128, 192, 32, 96, 64),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
)
Stage four increases complexity with five Inception modules producing outputs of 512, 512, 512, 528, and 832 channels. Channel distribution follows similar patterns with (3\times3) convolutions generating maximum channels, followed by (1\times1), (5\times5), and pooling paths. Reduction ratios vary slightly across modules:
stage4 = nn.Sequential(
InceptionModule(480, 192, 96, 208, 16, 48, 64),
InceptionModule(512, 160, 112, 224, 24, 64, 64),
InceptionModule(512, 128, 128, 256, 24, 64, 64),
InceptionModule(512, 112, 144, 288, 32, 64, 64),
InceptionModule(528, 256, 160, 320, 32, 128, 128),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
)
Stage five contains two Inception modules with 832 and 1024 output channels. Channel allocation maintains consistency with earlier stages but uses different absolute values. Following NiN's approach, global average pooling reduces each channel's spatial dimensions to 1×1, flattened before a classification layer:
stage5 = nn.Sequential(
InceptionModule(832, 256, 160, 320, 32, 128, 128),
InceptionModule(832, 384, 192, 384, 48, 128, 128),
nn.AdaptiveAvgPool2d((1, 1)),
nn.Flatten()
)
model = nn.Sequential(stage1, stage2, stage3, stage4, stage5, nn.Linear(1024, 10))
Due to computational complexity and limited flexibility compared to VGG, input resolution is reduced from 224×224 to 96×96 for Fashion-MNIST training efficiency. Shape transformations through each stage:
test_input = torch.randn(1, 1, 96, 96)
for layer in model:
test_input = layer(test_input)
print(f'{layer.__class__.__name__}: {test_input.shape}')
Output shapes demonstrate progressive dimensionality changes:
Sequential: torch.Size([1, 64, 24, 24])
Sequential: torch.Size([1, 192, 12, 12])
Sequential: torch.Size([1, 480, 6, 6])
Sequential: torch.Size([1, 832, 3, 3])
Sequential: torch.Size([1, 1024])
Linear: torch.Size([1, 10])
Model Training
Training uses the Fashion-MNIST dataset with images resized to 96×96 pixels:
learning_rate, epochs, batch_size = 0.1, 10, 128
train_loader, test_loader = d2l.load_data_fashion_mnist(batch_size, resize=96)
d2l.train_ch6(model, train_loader, test_loader, epochs, learning_rate, d2l.try_gpu())
Training achieves approximately 90.8% training accuracy and 88.7% test accuracy with throughput around 1277 samples per second on GPU.