Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Core Concepts and Architectures in Deep Learning Systems

Tech 2

Deep Learning vs. Classical Machine Learning

Deep Learning (DL) is a branch of machine learning that utilizes neural networks with multiple layers to model complex patterns in data. While traditional machine learning often requires manual feature engineering, deep learning excels at automatically extracting hierarchical features from unstructured data like imagery and text.

Feature Machine Learning Deep Learning
Feature Extraction Manual/Expert-driven Automated via layers
Model Architecture Shallow (Trees, SVMs) Deep (Multilayered Networks)
Data Scale Effective on small/medium sets Requires massive datasets
Performance plateaus on complex tasks High performance on vision/NLP

The Perceptron and Linear Separability

The Perceptron is the foundational unit of neural networks, acting as a linear classifier. It computes a weighted sum of inputs and applies a step function.

$$y = φ(\sum w_i x_i + b)$$

Limitations:

  • XOR Problem: A single perceptron cannot solve non-linearly separable problems.
  • Convergence: If data is not linearly separable, training will not converge.
  • Depth: Lacks hidden layers, restricting it to simple boundaries.

The Necessity of Non-Linearity

Neural networks use non-linear activation functions to approximate complex mappings. Without them, a multi-layer network would mathematically collapse into a single-layer linear transformation. The Universal Approximation Theorem states that a feedforward network with a single hidden layer and finite neurons can approximate any continuous function given appropriate non-linear activations.

Common Activation Functions

Activation Mathematical Definition Characteristics
Sigmoid $\sigma(x) = (1 + e^{-x})^{-1}$ S-shaped, ranges (0,1), prone to vanishing gradients
Tanh $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ Zero-centered, ranges (-1,1), stronger gradients than Sigmoid
ReLU $f(x) = \max(0, x)$ Efficient computation, alleviates vanishing gradient, can "die"
Leaky ReLU $f(x) = \max(\alpha x, x)$ Fixes "dying ReLU" by allowing small negative slopes
GELU $0.5x(1 + \text{erf}(x/\sqrt{2}))$ Probability-based, used in Transformer architectures

Optimization and Gradient Mechanics

Forward Propagation: Input data passes through layers to produce a prediction and calculate loss. Backawrd Propagation: Using the Chain Rule, gradients are computed from the loss back to each parameter. Gradient Descent: Parameters are updated in the opposite direction of the gradient to minimize loss: $$\theta_{new} = \theta_{old} - η \nabla L(\theta)$$

Handling Gradient Issues

  • Vanishing Gradients: Gradients become near-zero in early layers, stopping learning. Solved by ReLU, Batch Norm, or Residual links.
  • Exploding Gradients: Gradients grow exponentially, causing instability. Solved by Gradient Clipping or better initialization.

Regularization and Generalization

  • L1 (Lasso): Adds absolute value of weights to loss, encouraging sparsity.
  • L2 (Ridge): Adds squared weights to loss, preventing large weights and overfitting.
  • Dropout: Randomly deactivates neurons during training to reduce co-dependency.
  • Batch Normalization: Rescales layer inputs to have zero mean and unit variance, stabilizing training.

Convolutional Neural Networks (CNN)

CNNs leverage spatial hierarchies through kernels (filters).

  • Padding: Adding borders to input to maintain spatial dimensions.
  • Stride: The step size of the kernel movement. Larger strides reduce output resolution.
  • Pooling: Downsampling (Max or Average) to reduce computation and gain translation invariance.

Residual Learning (ResNet)

ResNet introduces shortcut connections that skip one or more layers. This allows grdaients to flow directly through the identity mapping, enabling the training of networks with hundreds of layers.

import torch
import torch.nn as nn

class IdentityBlock(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.pipeline = nn.Sequential(
            nn.Conv2d(dim, dim, 3, padding=1),
            nn.BatchNorm2d(dim),
            nn.ReLU(),
            nn.Conv2d(dim, dim, 3, padding=1),
            nn.BatchNorm2d(dim)
        )

    def forward(self, x):
        shortcut = x
        res = self.pipeline(x)
        return torch.relu(res + shortcut)

Recurrent Structures: RNN, LSTM, and GRU

Standard RNNs struggle with long-term memory due to vanishing gradients.

  • LSTM (Long Short-Term Memory): Uses Forget, Input, and Output gates to manage a persistent Cell State.
  • GRU (Gated Recurrent Unit): A streamlined version of LSTM combining gates into Reset and Update gates.
class SimpleGRUCell(nn.Module):
    def __init__(self, in_sz, hid_sz):
        super().__init__()
        self.hid_sz = hid_sz
        self.gate_w = nn.Linear(in_sz + hid_sz, 2 * hid_sz)
        self.cand_w = nn.Linear(in_sz + hid_sz, hid_sz)

    def forward(self, x, h_prev):
        combined = torch.cat([x, h_prev], dim=1)
        gates = torch.sigmoid(self.gate_w(combined))
        z, r = gates.chunk(2, 1)
        
        combined_r = torch.cat([x, r * h_prev], dim=1)
        h_tilde = torch.tanh(self.cand_w(combined_r))
        
        h_next = (1 - z) * h_prev + z * h_tilde
        return h_next

Transformer Architecture

The Transformer relies on Self-Attention to weigh the importance of different tokens in a sequence regardless of distance.

  1. Multi-Head Attention: Multiple attention heads learn different relationships.
  2. LayerNorm & Residuals: Applied after every sub-block.
  3. Feed-Forward Networks: Position-wise MLP applied to each token.

Bert and GPT Paradigms

  • BERT (Encoder-only): Uses Masked Language Modeling (MLM) to learn bidirectional context. Ideal for classification and NER.
  • GPT (Decoder-only): Uses Causal Language Modeling (predicting next token) for generative tasks.

Data Imbalance Strategies

  1. Resampling: Over-sampling minority classes or under-sampling majority classes.
  2. Cost-Sensitive Learning: Assigning higher loss weights to minority samples.
  3. Focal Loss: Down-weighting easy examples to focus on hard, minority samples.
  4. Metrics: Using F1-Score, Precision-Recall curves, or AUC instead of Accuracy.

Evaluation Metrics

Metric Calculation Best Use Case
Precision TP / (TP + FP) Minimize False Positives (e.g., spam detection)
Recall TP / (TP + FN) Minimize False Negatives (e.g., cancer detection)
F1-Score Harmonic Mean of P & R Balanced evaluation for imbalanced data
IoU Area of Overlap / Area of Union Object detection and segmentation

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.