Home > Tech > Content

Core Concepts and Architectures in Deep Learning Systems

Tech Apr 20 17

Deep Learning vs. Classical Machine Learning

Deep Learning (DL) is a branch of machine learning that utilizes neural networks with multiple layers to model complex patterns in data. While traditional machine learning often requires manual feature engineering, deep learning excels at automatically extracting hierarchical features from unstructured data like imagery and text.

Feature	Machine Learning	Deep Learning
Feature Extraction	Manual/Expert-driven	Automated via layers
Model Architecture	Shallow (Trees, SVMs)	Deep (Multilayered Networks)
Data Scale	Effective on small/medium sets	Requires massive datasets
Performance	plateaus on complex tasks	High performance on vision/NLP

The Perceptron and Linear Separability

The Perceptron is the foundational unit of neural networks, acting as a linear classifier. It computes a weighted sum of inputs and applies a step function.

$$y = φ(\sum w_i x_i + b)$$

Limitations:

XOR Problem: A single perceptron cannot solve non-linearly separable problems.
Convergence: If data is not linearly separable, training will not converge.
Depth: Lacks hidden layers, restricting it to simple boundaries.

The Necessity of Non-Linearity

Neural networks use non-linear activation functions to approximate complex mappings. Without them, a multi-layer network would mathematically collapse into a single-layer linear transformation. The Universal Approximation Theorem states that a feedforward network with a single hidden layer and finite neurons can approximate any continuous function given appropriate non-linear activations.

Common Activation Functions

Activation	Mathematical Definition	Characteristics
Sigmoid	$\sigma(x) = (1 + e^{-x})^{-1}$	S-shaped, ranges (0,1), prone to vanishing gradients
Tanh	$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$	Zero-centered, ranges (-1,1), stronger gradients than Sigmoid
ReLU	$f(x) = \max(0, x)$	Efficient computation, alleviates vanishing gradient, can "die"
Leaky ReLU	$f(x) = \max(\alpha x, x)$	Fixes "dying ReLU" by allowing small negative slopes
GELU	$0.5x(1 + \text{erf}(x/\sqrt{2}))$	Probability-based, used in Transformer architectures

Optimization and Gradient Mechanics

Forward Propagation: Input data passes through layers to produce a prediction and calculate loss. Backawrd Propagation: Using the Chain Rule, gradients are computed from the loss back to each parameter. Gradient Descent: Parameters are updated in the opposite direction of the gradient to minimize loss: $$\theta_{new} = \theta_{old} - η \nabla L(\theta)$$

Handling Gradient Issues

Vanishing Gradients: Gradients become near-zero in early layers, stopping learning. Solved by ReLU, Batch Norm, or Residual links.
Exploding Gradients: Gradients grow exponentially, causing instability. Solved by Gradient Clipping or better initialization.

Regularization and Generalization

L1 (Lasso): Adds absolute value of weights to loss, encouraging sparsity.
L2 (Ridge): Adds squared weights to loss, preventing large weights and overfitting.
Dropout: Randomly deactivates neurons during training to reduce co-dependency.
Batch Normalization: Rescales layer inputs to have zero mean and unit variance, stabilizing training.

Convolutional Neural Networks (CNN)

CNNs leverage spatial hierarchies through kernels (filters).

Padding: Adding borders to input to maintain spatial dimensions.
Stride: The step size of the kernel movement. Larger strides reduce output resolution.
Pooling: Downsampling (Max or Average) to reduce computation and gain translation invariance.

Residual Learning (ResNet)

ResNet introduces shortcut connections that skip one or more layers. This allows grdaients to flow directly through the identity mapping, enabling the training of networks with hundreds of layers.

import torch
import torch.nn as nn

class IdentityBlock(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.pipeline = nn.Sequential(
            nn.Conv2d(dim, dim, 3, padding=1),
            nn.BatchNorm2d(dim),
            nn.ReLU(),
            nn.Conv2d(dim, dim, 3, padding=1),
            nn.BatchNorm2d(dim)
        )

    def forward(self, x):
        shortcut = x
        res = self.pipeline(x)
        return torch.relu(res + shortcut)

Recurrent Structures: RNN, LSTM, and GRU

Standard RNNs struggle with long-term memory due to vanishing gradients.

LSTM (Long Short-Term Memory): Uses Forget, Input, and Output gates to manage a persistent Cell State.
GRU (Gated Recurrent Unit): A streamlined version of LSTM combining gates into Reset and Update gates.

class SimpleGRUCell(nn.Module):
    def __init__(self, in_sz, hid_sz):
        super().__init__()
        self.hid_sz = hid_sz
        self.gate_w = nn.Linear(in_sz + hid_sz, 2 * hid_sz)
        self.cand_w = nn.Linear(in_sz + hid_sz, hid_sz)

    def forward(self, x, h_prev):
        combined = torch.cat([x, h_prev], dim=1)
        gates = torch.sigmoid(self.gate_w(combined))
        z, r = gates.chunk(2, 1)
        
        combined_r = torch.cat([x, r * h_prev], dim=1)
        h_tilde = torch.tanh(self.cand_w(combined_r))
        
        h_next = (1 - z) * h_prev + z * h_tilde
        return h_next

Transformer Architecture

The Transformer relies on Self-Attention to weigh the importance of different tokens in a sequence regardless of distance.

Multi-Head Attention: Multiple attention heads learn different relationships.
LayerNorm & Residuals: Applied after every sub-block.
Feed-Forward Networks: Position-wise MLP applied to each token.

Bert and GPT Paradigms

BERT (Encoder-only): Uses Masked Language Modeling (MLM) to learn bidirectional context. Ideal for classification and NER.
GPT (Decoder-only): Uses Causal Language Modeling (predicting next token) for generative tasks.

Data Imbalance Strategies

Resampling: Over-sampling minority classes or under-sampling majority classes.
Cost-Sensitive Learning: Assigning higher loss weights to minority samples.
Focal Loss: Down-weighting easy examples to focus on hard, minority samples.
Metrics: Using F1-Score, Precision-Recall curves, or AUC instead of Accuracy.

Evaluation Metrics

Metric	Calculation	Best Use Case
Precision	TP / (TP + FP)	Minimize False Positives (e.g., spam detection)
Recall	TP / (TP + FN)	Minimize False Negatives (e.g., cancer detection)
F1-Score	Harmonic Mean of P & R	Balanced evaluation for imbalanced data
IoU	Area of Overlap / Area of Union	Object detection and segmentation

Tags: Deep Learning

Back to List

Prev: Operator Overloading and Inheritance in C++

Next: Core Properties and Manipulation Methods of UIView in iOS

Fading Coder

Core Concepts and Architectures in Deep Learning Systems

Deep Learning vs. Classical Machine Learning

The Perceptron and Linear Separability

The Necessity of Non-Linearity

Common Activation Functions

Optimization and Gradient Mechanics

Handling Gradient Issues

Regularization and Generalization

Convolutional Neural Networks (CNN)

Residual Learning (ResNet)

Recurrent Structures: RNN, LSTM, and GRU

Transformer Architecture

Bert and GPT Paradigms

Data Imbalance Strategies

Evaluation Metrics

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor