Core Concepts and Architectures in Deep Learning Systems
Deep Learning vs. Classical Machine Learning
Deep Learning (DL) is a branch of machine learning that utilizes neural networks with multiple layers to model complex patterns in data. While traditional machine learning often requires manual feature engineering, deep learning excels at automatically extracting hierarchical features from unstructured data like imagery and text.
| Feature | Machine Learning | Deep Learning |
|---|---|---|
| Feature Extraction | Manual/Expert-driven | Automated via layers |
| Model Architecture | Shallow (Trees, SVMs) | Deep (Multilayered Networks) |
| Data Scale | Effective on small/medium sets | Requires massive datasets |
| Performance | plateaus on complex tasks | High performance on vision/NLP |
The Perceptron and Linear Separability
The Perceptron is the foundational unit of neural networks, acting as a linear classifier. It computes a weighted sum of inputs and applies a step function.
$$y = φ(\sum w_i x_i + b)$$
Limitations:
- XOR Problem: A single perceptron cannot solve non-linearly separable problems.
- Convergence: If data is not linearly separable, training will not converge.
- Depth: Lacks hidden layers, restricting it to simple boundaries.
The Necessity of Non-Linearity
Neural networks use non-linear activation functions to approximate complex mappings. Without them, a multi-layer network would mathematically collapse into a single-layer linear transformation. The Universal Approximation Theorem states that a feedforward network with a single hidden layer and finite neurons can approximate any continuous function given appropriate non-linear activations.
Common Activation Functions
| Activation | Mathematical Definition | Characteristics |
|---|---|---|
| Sigmoid | $\sigma(x) = (1 + e^{-x})^{-1}$ | S-shaped, ranges (0,1), prone to vanishing gradients |
| Tanh | $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ | Zero-centered, ranges (-1,1), stronger gradients than Sigmoid |
| ReLU | $f(x) = \max(0, x)$ | Efficient computation, alleviates vanishing gradient, can "die" |
| Leaky ReLU | $f(x) = \max(\alpha x, x)$ | Fixes "dying ReLU" by allowing small negative slopes |
| GELU | $0.5x(1 + \text{erf}(x/\sqrt{2}))$ | Probability-based, used in Transformer architectures |
Optimization and Gradient Mechanics
Forward Propagation: Input data passes through layers to produce a prediction and calculate loss. Backawrd Propagation: Using the Chain Rule, gradients are computed from the loss back to each parameter. Gradient Descent: Parameters are updated in the opposite direction of the gradient to minimize loss: $$\theta_{new} = \theta_{old} - η \nabla L(\theta)$$
Handling Gradient Issues
- Vanishing Gradients: Gradients become near-zero in early layers, stopping learning. Solved by ReLU, Batch Norm, or Residual links.
- Exploding Gradients: Gradients grow exponentially, causing instability. Solved by Gradient Clipping or better initialization.
Regularization and Generalization
- L1 (Lasso): Adds absolute value of weights to loss, encouraging sparsity.
- L2 (Ridge): Adds squared weights to loss, preventing large weights and overfitting.
- Dropout: Randomly deactivates neurons during training to reduce co-dependency.
- Batch Normalization: Rescales layer inputs to have zero mean and unit variance, stabilizing training.
Convolutional Neural Networks (CNN)
CNNs leverage spatial hierarchies through kernels (filters).
- Padding: Adding borders to input to maintain spatial dimensions.
- Stride: The step size of the kernel movement. Larger strides reduce output resolution.
- Pooling: Downsampling (Max or Average) to reduce computation and gain translation invariance.
Residual Learning (ResNet)
ResNet introduces shortcut connections that skip one or more layers. This allows grdaients to flow directly through the identity mapping, enabling the training of networks with hundreds of layers.
import torch
import torch.nn as nn
class IdentityBlock(nn.Module):
def __init__(self, dim):
super().__init__()
self.pipeline = nn.Sequential(
nn.Conv2d(dim, dim, 3, padding=1),
nn.BatchNorm2d(dim),
nn.ReLU(),
nn.Conv2d(dim, dim, 3, padding=1),
nn.BatchNorm2d(dim)
)
def forward(self, x):
shortcut = x
res = self.pipeline(x)
return torch.relu(res + shortcut)
Recurrent Structures: RNN, LSTM, and GRU
Standard RNNs struggle with long-term memory due to vanishing gradients.
- LSTM (Long Short-Term Memory): Uses Forget, Input, and Output gates to manage a persistent Cell State.
- GRU (Gated Recurrent Unit): A streamlined version of LSTM combining gates into Reset and Update gates.
class SimpleGRUCell(nn.Module):
def __init__(self, in_sz, hid_sz):
super().__init__()
self.hid_sz = hid_sz
self.gate_w = nn.Linear(in_sz + hid_sz, 2 * hid_sz)
self.cand_w = nn.Linear(in_sz + hid_sz, hid_sz)
def forward(self, x, h_prev):
combined = torch.cat([x, h_prev], dim=1)
gates = torch.sigmoid(self.gate_w(combined))
z, r = gates.chunk(2, 1)
combined_r = torch.cat([x, r * h_prev], dim=1)
h_tilde = torch.tanh(self.cand_w(combined_r))
h_next = (1 - z) * h_prev + z * h_tilde
return h_next
Transformer Architecture
The Transformer relies on Self-Attention to weigh the importance of different tokens in a sequence regardless of distance.
- Multi-Head Attention: Multiple attention heads learn different relationships.
- LayerNorm & Residuals: Applied after every sub-block.
- Feed-Forward Networks: Position-wise MLP applied to each token.
Bert and GPT Paradigms
- BERT (Encoder-only): Uses Masked Language Modeling (MLM) to learn bidirectional context. Ideal for classification and NER.
- GPT (Decoder-only): Uses Causal Language Modeling (predicting next token) for generative tasks.
Data Imbalance Strategies
- Resampling: Over-sampling minority classes or under-sampling majority classes.
- Cost-Sensitive Learning: Assigning higher loss weights to minority samples.
- Focal Loss: Down-weighting easy examples to focus on hard, minority samples.
- Metrics: Using F1-Score, Precision-Recall curves, or AUC instead of Accuracy.
Evaluation Metrics
| Metric | Calculation | Best Use Case |
|---|---|---|
| Precision | TP / (TP + FP) | Minimize False Positives (e.g., spam detection) |
| Recall | TP / (TP + FN) | Minimize False Negatives (e.g., cancer detection) |
| F1-Score | Harmonic Mean of P & R | Balanced evaluation for imbalanced data |
| IoU | Area of Overlap / Area of Union | Object detection and segmentation |