Home > Tech > Content

Implementing Cat and Dog Image Classification with PyTorch

Tech Apr 21 18

The task of distinguishing cats from dogs originates from a beginner-level Kaggle competition titled Dogs vs Cats. To gain deeper insights into Convolutional Neural Networks (CNNs), several classic models like LeNet, AlexNet, and ResNet were implemented using PyTorch. This exploration investigates how different factors—such as network architecture, dataset size, data augmentation, and dropout—affect prediction accuracy. The source code is available on GitHub.

Problem Statement

Train a model on a labeled dataset to predict whether an image contains a cat or a dog. The training set comprises 25,000 images, and the test set has 12,500 images. The dataset can be downloaded from the official Kaggle repository.

Data Preprocessing

Cleaning Damaged Images

In 01_clean.py, various methods are used to detect corrupted images:

Checking for JFIF headers at the start of files.
Using imghdr.what() to identify file types.
Verifying image integrity using Image.open().verify().

Dataset Construction

To manage over 10,000 images efficiently, a script (02_data_processing.py) copies a specified number of images into a train directory and renames them systematically to facilitate label assignment for each image.

Image Transformation Pipeline

The preprocessing pipeline includes:

Cropping images to a fixed size (224x224).
Converting images to tensors.
Normalizing pixel values across RGB channels.
Applying data augmentation techniques.
Creating a DataLoader via PyTorch's Dataset class.

A custom dataset class Mydata is defined in dataset.py inheriting from torch.utils.data.Dataset. It implements three essential methods:

(1) Initialization

Loads image paths and splits data into training and validation sets:


class Mydata(torch.utils.data.Dataset):
    def __init__(self, root, transforms=None, train=True):
        if train:
            self.imgs = imgs[:int(0.8 * imgs_num)]
        else:
            self.imgs = imgs[int(0.8 * imgs_num):]
        
        if transforms is None:
            normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                            std=[0.229, 0.224, 0.225])
            self.transforms = transforms.Compose([
                transforms.CenterCrop(224),
                transforms.Resize((224, 224)),
                transforms.ToTensor(),
                normalize
            ])

(2) Get Item Method

Processes an image and returns its tensor and corresponding label:

    def __getitem__(self, index):
        return data, label

(3) Length Method

Returns the total count of images in the dataset:

    def __len__(self):
        return len(self.imgs)

(4) Testing

After instantiating the dataset, you can retrieve a processed image using __getitem__():

if __name__ == "__main__":
    root = "./data/train"
    train = Mydata(root, train=True)
    img, label = train.__getitem__(5)
    print(img.dtype)
    print(img.size(), label)
    print(len(train))

# Output:
torch.float32
torch.Size([3, 224, 224]) 0
3200

Model Architecture

All models are defined in models.py, including LeNet, AlexNet, ResNet, and SqueezeNet. Here are key implementations:

LeNet Model

Designed originally for handwritten digit recognition, LeNet is adapted here with:

Three convolutional layers.
Three fully connected layers.
ReLU activation functions.
Batch normalization after convolutions.


class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        self.relu = nn.ReLU()
        self.conv1 = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=3, stride=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2)
        )
        self.conv2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=3, stride=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2)
        )
        self.conv3 = nn.Sequential(
            nn.Conv2d(32, 64, kernel_size=3, stride=2),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2)
        )
        self.fc1 = nn.Linear(3 * 3 * 64, 64)
        self.fc2 = nn.Linear(64, 10)
        self.out = nn.Linear(10, 2)

    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        x = self.conv3(x)
        x = x.view(x.shape[0], -1)
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.out(x)
        return x

AlexNet Model

Introduced in 2012, AlexNet features:

Eight layers including five convolutional and three fully connected.
ReLU activation functions.
Dropout for regularization.
Local Response Normalization (LRN).
Overlapping max pooling.

Training Process

Training logic resides in main.py, which handles:

Specifying model, epochs, and hyperparameters.
Resuming training from checkpoints.
Saving best and latest models.
Evaluating performance metrics.
Visualizing training progress with TensorBoard.

Initiating Training

Execute the script:

python3 main.py

If interrupted, resume training by setting resume=True.

TensorBoard Visualization

Start TensorBoard:

tensorboard --logdir runs

Comparative Analysis of Model Performance

LeNet Experiments

No Augmentation, Small Dataset (1000 samples)

Accuracy stabilizes around 63% after ~30 epochs.
Validation loss increases despite decreasing training loss.
Indicates overfitting due to limited data.

Larger Dataset Without Augmentation (4000 samples)

Accuracy improves to ~68%, showing benefit of more data.

With Data Augmentation (4000 samples)

Horizontal flip (p=0.5), vertical flip (p=0.1).
Accuracy reaches ~71%.

Stronger Augmentation (4000 samples)

Horizontal flip (p=0.5), vertical flip (p=0.5), brightness adjustment.
Accuracy peaks at ~75% before slight decline.

Adding Dropout Regularization

Applied after first FC layer.
Maintains stable validation loss.
Achieves ~76% final accuracy without overfitting.

AlexNet

More parameters than LeNet.
Requires batch normalization, SGD optimizer, and careful learning rate tuning.
Final accuracy reaches ~78%.

SqueezeNet

Utilizes transfer learning.
After 16 epochs, achieves ~93% accuracy.

ResNet

Uses pre-trained ResNet50.
After 25 epochs, achieves ~98% accuracy.

Prediction

Once trained, predictions are made using predict.py:

model = LeNet1()
modelpath = "./runs/LeNet1_1/LeNet1_best.pth"
checkpoint = torch.load(modelpath)
model.load_state_dict(checkpoint)
root = "test_pics"

Predicted images are saved in an output folder along with predicted classes and confidence scores.

Back to List

Prev: Unified Architecture for Multi-Modal Content Generation Pipelines

Next: Encoding Java Strings into GBK Format

Fading Coder

Implementing Cat and Dog Image Classification with PyTorch

Problem Statement

Data Preprocessing

Cleaning Damaged Images

Dataset Construction

Image Transformation Pipeline

(1) Initialization

(2) Get Item Method

(3) Length Method

(4) Testing

Model Architecture

LeNet Model

AlexNet Model

Training Process

Initiating Training

TensorBoard Visualization

Comparative Analysis of Model Performance

LeNet Experiments

No Augmentation, Small Dataset (1000 samples)

Larger Dataset Without Augmentation (4000 samples)

With Data Augmentation (4000 samples)

Stronger Augmentation (4000 samples)

Adding Dropout Regularization

AlexNet

SqueezeNet

ResNet

Prediction

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Implementing Cat and Dog Image Classification with PyTorch

Problem Statement

Data Preprocessing

Cleaning Damaged Images

Dataset Construction

Image Transformation Pipeline

(1) Initialization

(2) Get Item Method

(3) Length Method

(4) Testing

Model Architecture

LeNet Model

AlexNet Model

Training Process

Initiating Training

TensorBoard Visualization

Comparative Analysis of Model Performance

LeNet Experiments

No Augmentation, Small Dataset (1000 samples)

Larger Dataset Without Augmentation (4000 samples)

With Data Augmentation (4000 samples)

Stronger Augmentation (4000 samples)

Adding Dropout Regularization

AlexNet

SqueezeNet

ResNet

Prediction

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment