Implementing Cat and Dog Image Classification with PyTorch
The task of distinguishing cats from dogs originates from a beginner-level Kaggle competition titled Dogs vs Cats. To gain deeper insights into Convolutional Neural Networks (CNNs), several classic models like LeNet, AlexNet, and ResNet were implemented using PyTorch. This exploration investigates how different factors—such as network architecture, dataset size, data augmentation, and dropout—affect prediction accuracy. The source code is available on GitHub.
Problem Statement
Train a model on a labeled dataset to predict whether an image contains a cat or a dog. The training set comprises 25,000 images, and the test set has 12,500 images. The dataset can be downloaded from the official Kaggle repository.
Data Preprocessing
Cleaning Damaged Images
In 01_clean.py, various methods are used to detect corrupted images:
- Checking for JFIF headers at the start of files.
- Using
imghdr.what()to identify file types. - Verifying image integrity using
Image.open().verify().
Dataset Construction
To manage over 10,000 images efficiently, a script (02_data_processing.py) copies a specified number of images into a train directory and renames them systematically to facilitate label assignment for each image.
Image Transformation Pipeline
The preprocessing pipeline includes:
- Cropping images to a fixed size (224x224).
- Converting images to tensors.
- Normalizing pixel values across RGB channels.
- Applying data augmentation techniques.
- Creating a DataLoader via PyTorch's Dataset class.
A custom dataset class Mydata is defined in dataset.py inheriting from torch.utils.data.Dataset. It implements three essential methods:
(1) Initialization
Loads image paths and splits data into training and validation sets:
class Mydata(torch.utils.data.Dataset):
def __init__(self, root, transforms=None, train=True):
if train:
self.imgs = imgs[:int(0.8 * imgs_num)]
else:
self.imgs = imgs[int(0.8 * imgs_num):]
if transforms is None:
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
self.transforms = transforms.Compose([
transforms.CenterCrop(224),
transforms.Resize((224, 224)),
transforms.ToTensor(),
normalize
])
(2) Get Item Method
Processes an image and returns its tensor and corresponding label:
def __getitem__(self, index):
return data, label
(3) Length Method
Returns the total count of images in the dataset:
def __len__(self):
return len(self.imgs)
(4) Testing
After instantiating the dataset, you can retrieve a processed image using __getitem__():
if __name__ == "__main__":
root = "./data/train"
train = Mydata(root, train=True)
img, label = train.__getitem__(5)
print(img.dtype)
print(img.size(), label)
print(len(train))
# Output:
torch.float32
torch.Size([3, 224, 224]) 0
3200
Model Architecture
All models are defined in models.py, including LeNet, AlexNet, ResNet, and SqueezeNet. Here are key implementations:
LeNet Model
Designed originally for handwritten digit recognition, LeNet is adapted here with:
- Three convolutional layers.
- Three fully connected layers.
- ReLU activation functions.
- Batch normalization after convolutions.
class LeNet(nn.Module):
def __init__(self):
super(LeNet, self).__init__()
self.relu = nn.ReLU()
self.conv1 = nn.Sequential(
nn.Conv2d(3, 16, kernel_size=3, stride=2),
nn.BatchNorm2d(16),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2)
)
self.conv2 = nn.Sequential(
nn.Conv2d(16, 32, kernel_size=3, stride=2),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2)
)
self.conv3 = nn.Sequential(
nn.Conv2d(32, 64, kernel_size=3, stride=2),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2)
)
self.fc1 = nn.Linear(3 * 3 * 64, 64)
self.fc2 = nn.Linear(64, 10)
self.out = nn.Linear(10, 2)
def forward(self, x):
x = self.conv1(x)
x = self.conv2(x)
x = self.conv3(x)
x = x.view(x.shape[0], -1)
x = self.relu(self.fc1(x))
x = self.relu(self.fc2(x))
x = self.out(x)
return x
AlexNet Model
Introduced in 2012, AlexNet features:
- Eight layers including five convolutional and three fully connected.
- ReLU activation functions.
- Dropout for regularization.
- Local Response Normalization (LRN).
- Overlapping max pooling.
Training Process
Training logic resides in main.py, which handles:
- Specifying model, epochs, and hyperparameters.
- Resuming training from checkpoints.
- Saving best and latest models.
- Evaluating performance metrics.
- Visualizing training progress with TensorBoard.
Initiating Training
Execute the script:
python3 main.py
If interrupted, resume training by setting resume=True.
TensorBoard Visualization
Start TensorBoard:
tensorboard --logdir runs
Comparative Analysis of Model Performance
LeNet Experiments
No Augmentation, Small Dataset (1000 samples)
- Accuracy stabilizes around 63% after ~30 epochs.
- Validation loss increases despite decreasing training loss.
- Indicates overfitting due to limited data.
Larger Dataset Without Augmentation (4000 samples)
- Accuracy improves to ~68%, showing benefit of more data.
With Data Augmentation (4000 samples)
- Horizontal flip (p=0.5), vertical flip (p=0.1).
- Accuracy reaches ~71%.
Stronger Augmentation (4000 samples)
- Horizontal flip (p=0.5), vertical flip (p=0.5), brightness adjustment.
- Accuracy peaks at ~75% before slight decline.
Adding Dropout Regularization
- Applied after first FC layer.
- Maintains stable validation loss.
- Achieves ~76% final accuracy without overfitting.
AlexNet
- More parameters than LeNet.
- Requires batch normalization, SGD optimizer, and careful learning rate tuning.
- Final accuracy reaches ~78%.
SqueezeNet
- Utilizes transfer learning.
- After 16 epochs, achieves ~93% accuracy.
ResNet
- Uses pre-trained ResNet50.
- After 25 epochs, achieves ~98% accuracy.
Prediction
Once trained, predictions are made using predict.py:
model = LeNet1()
modelpath = "./runs/LeNet1_1/LeNet1_best.pth"
checkpoint = torch.load(modelpath)
model.load_state_dict(checkpoint)
root = "test_pics"
Predicted images are saved in an output folder along with predicted classes and confidence scores.