Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Implementing Cat and Dog Image Classification with PyTorch

Tech 1

The task of distinguishing cats from dogs originates from a beginner-level Kaggle competition titled Dogs vs Cats. To gain deeper insights into Convolutional Neural Networks (CNNs), several classic models like LeNet, AlexNet, and ResNet were implemented using PyTorch. This exploration investigates how different factors—such as network architecture, dataset size, data augmentation, and dropout—affect prediction accuracy. The source code is available on GitHub.

Problem Statement

Train a model on a labeled dataset to predict whether an image contains a cat or a dog. The training set comprises 25,000 images, and the test set has 12,500 images. The dataset can be downloaded from the official Kaggle repository.

Data Preprocessing

Cleaning Damaged Images

In 01_clean.py, various methods are used to detect corrupted images:

  1. Checking for JFIF headers at the start of files.
  2. Using imghdr.what() to identify file types.
  3. Verifying image integrity using Image.open().verify().

Dataset Construction

To manage over 10,000 images efficiently, a script (02_data_processing.py) copies a specified number of images into a train directory and renames them systematically to facilitate label assignment for each image.

Image Transformation Pipeline

The preprocessing pipeline includes:

  1. Cropping images to a fixed size (224x224).
  2. Converting images to tensors.
  3. Normalizing pixel values across RGB channels.
  4. Applying data augmentation techniques.
  5. Creating a DataLoader via PyTorch's Dataset class.

A custom dataset class Mydata is defined in dataset.py inheriting from torch.utils.data.Dataset. It implements three essential methods:

(1) Initialization

Loads image paths and splits data into training and validation sets:


class Mydata(torch.utils.data.Dataset):
    def __init__(self, root, transforms=None, train=True):
        if train:
            self.imgs = imgs[:int(0.8 * imgs_num)]
        else:
            self.imgs = imgs[int(0.8 * imgs_num):]
        
        if transforms is None:
            normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                            std=[0.229, 0.224, 0.225])
            self.transforms = transforms.Compose([
                transforms.CenterCrop(224),
                transforms.Resize((224, 224)),
                transforms.ToTensor(),
                normalize
            ])

(2) Get Item Method

Processes an image and returns its tensor and corresponding label:

    def __getitem__(self, index):
        return data, label

(3) Length Method

Returns the total count of images in the dataset:

    def __len__(self):
        return len(self.imgs)

(4) Testing

After instantiating the dataset, you can retrieve a processed image using __getitem__():

if __name__ == "__main__":
    root = "./data/train"
    train = Mydata(root, train=True)
    img, label = train.__getitem__(5)
    print(img.dtype)
    print(img.size(), label)
    print(len(train))

# Output:
torch.float32
torch.Size([3, 224, 224]) 0
3200

Model Architecture

All models are defined in models.py, including LeNet, AlexNet, ResNet, and SqueezeNet. Here are key implementations:

LeNet Model

Designed originally for handwritten digit recognition, LeNet is adapted here with:

  1. Three convolutional layers.
  2. Three fully connected layers.
  3. ReLU activation functions.
  4. Batch normalization after convolutions.

class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        self.relu = nn.ReLU()
        self.conv1 = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=3, stride=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2)
        )
        self.conv2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=3, stride=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2)
        )
        self.conv3 = nn.Sequential(
            nn.Conv2d(32, 64, kernel_size=3, stride=2),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2)
        )
        self.fc1 = nn.Linear(3 * 3 * 64, 64)
        self.fc2 = nn.Linear(64, 10)
        self.out = nn.Linear(10, 2)

    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        x = self.conv3(x)
        x = x.view(x.shape[0], -1)
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.out(x)
        return x

AlexNet Model

Introduced in 2012, AlexNet features:

  1. Eight layers including five convolutional and three fully connected.
  2. ReLU activation functions.
  3. Dropout for regularization.
  4. Local Response Normalization (LRN).
  5. Overlapping max pooling.

Training Process

Training logic resides in main.py, which handles:

  1. Specifying model, epochs, and hyperparameters.
  2. Resuming training from checkpoints.
  3. Saving best and latest models.
  4. Evaluating performance metrics.
  5. Visualizing training progress with TensorBoard.

Initiating Training

Execute the script:

python3 main.py

If interrupted, resume training by setting resume=True.

TensorBoard Visualization

Start TensorBoard:

tensorboard --logdir runs

Comparative Analysis of Model Performance

LeNet Experiments

No Augmentation, Small Dataset (1000 samples)

  • Accuracy stabilizes around 63% after ~30 epochs.
  • Validation loss increases despite decreasing training loss.
  • Indicates overfitting due to limited data.

Larger Dataset Without Augmentation (4000 samples)

  • Accuracy improves to ~68%, showing benefit of more data.

With Data Augmentation (4000 samples)

  • Horizontal flip (p=0.5), vertical flip (p=0.1).
  • Accuracy reaches ~71%.

Stronger Augmentation (4000 samples)

  • Horizontal flip (p=0.5), vertical flip (p=0.5), brightness adjustment.
  • Accuracy peaks at ~75% before slight decline.

Adding Dropout Regularization

  • Applied after first FC layer.
  • Maintains stable validation loss.
  • Achieves ~76% final accuracy without overfitting.

AlexNet

  • More parameters than LeNet.
  • Requires batch normalization, SGD optimizer, and careful learning rate tuning.
  • Final accuracy reaches ~78%.

SqueezeNet

  • Utilizes transfer learning.
  • After 16 epochs, achieves ~93% accuracy.

ResNet

  • Uses pre-trained ResNet50.
  • After 25 epochs, achieves ~98% accuracy.

Prediction

Once trained, predictions are made using predict.py:

model = LeNet1()
modelpath = "./runs/LeNet1_1/LeNet1_best.pth"
checkpoint = torch.load(modelpath)
model.load_state_dict(checkpoint)
root = "test_pics"

Predicted images are saved in an output folder along with predicted classes and confidence scores.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.