Foundations of Neural Networks and Deep Learning
Perceptrons and Logical Operations
A perceptron is a binary classifier that takes multiple inputs and produces a single output. Each input is weighted, and the output is determined by whether the weighted sum exceeds a threshold — yielding 1 (fire) if true, 0 (no fire) otherwise.
Basic Logic Gates
- AND gate: Outputs 1 only when both inputs are 1.
- NAND gate: Inverts the AND output.
- OR gate: Outputs 1 if at least one input is 1.
Implementation with Weights and Bias
The bias term b shifts the decision boundary, while weights w₁, w₂ scale input contributions. A gate can be implemented as:
import numpy as np
def and_gate(x1, x2):
x = np.array([x1, x2])
w = np.array([0.5, 0.5])
b = -0.7
z = np.dot(x, w) + b
return 1 if z > 0 else 0
def nand_gate(x1, x2):
x = np.array([x1, x2])
w = np.array([-0.5, -0.5])
b = 0.7
z = np.dot(x, w) + b
return 1 if z > 0 else 0
def or_gate(x1, x2):
x = np.array([x1, x2])
w = np.array([0.5, 0.5])
b = -0.2
z = np.dot(x, w) + b
return 1 if z > 0 else 0
Limitation: XOR and Linear Separability
A single-layer perceptron cannot compute XOR, because XOR is not linearly separable — no straight line can separate its truth table outputs. This limitation motivates multi-layer architectures.
Multi-Layer Networks and Activation Functions
Replacing the step function with smooth, differentiable functions transforms a perceptron into a neural network capable of gradient-based learning.
Common Activation Functions
- Step function: Discontinuous, non-differentiable; used in classical perceptrons.
- Sigmoid: Smooth S-shaped curve, bounded between 0 and 1.
- ReLU (Rectified Linear Unit):
f(x) = max(0, x)— efficient and avoids vanishing gradients for positive inputs.
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def relu(x):
return np.maximum(0, x)
def step(x):
return (x > 0).astype(int)
x_vals = np.linspace(-4, 4, 1000)
plt.figure(figsize=(8, 4))
plt.plot(x_vals, step(x_vals), label='Step', linestyle='--')
plt.plot(x_vals, sigmoid(x_vals), label='Sigmoid', linestyle='-.')
plt.plot(x_vals, relu(x_vals), label='ReLU')
plt.legend()
plt.grid(True)
plt.show()
Matrix Operations in Neural Networks
Neural layers perform afffine transformations: z = x @ W + b, where x is enput, W is weight matrix, and b is bias vector.
- For a 2D input
xof shape(N, D_in)and weightWof shape(D_in, D_out), output shape is(N, D_out). - Broadcasting and
np.dot()handle batched computation efficiently.
X = np.array([[1.0, 0.5]]) # shape: (1, 2)
W1 = np.array([[0.1, 0.3, 0.5],
[0.2, 0.4, 0.6]]) # shape: (2, 3)
b1 = np.array([0.1, 0.2, 0.3]) # shape: (3,)
A1 = np.dot(X, W1) + b1 # shape: (1, 3)
Z1 = sigmoid(A1)
Building a Three-Layer Feedforward Network
import numpy as np
def init_params():
return {
'W1': np.array([[0.1, 0.3, 0.5], [0.2, 0.4, 0.6]]),
'b1': np.array([0.1, 0.2, 0.3]),
'W2': np.array([[0.1, 0.4], [0.2, 0.5], [0.3, 0.6]]),
'b2': np.array([0.1, 0.2]),
'W3': np.array([[0.1, 0.3], [0.2, 0.4]]),
'b3': np.array([0.1, 0.2])
}
def forward(params, x):
a1 = np.dot(x, params['W1']) + params['b1']
z1 = sigmoid(a1)
a2 = np.dot(z1, params['W2']) + params['b2']
z2 = sigmoid(a2)
a3 = np.dot(z2, params['W3']) + params['b3']
return a3 # identity output
params = init_params()
x_input = np.array([[1.0, 0.5]])
y_output = forward(params, x_input)
print(y_output) # [[0.31682708 0.69627909]]
Output Layers: Regression vs Classification
- Regression: Use identity activation (
y = x) — output is a continuous value. - Classification: Use softmax to convert logits into probability-like outputs summing to 1.
def softmax(logits):
shifted = logits - np.max(logits, axis=-1, keepdims=True)
exps = np.exp(shifted)
return exps / np.sum(exps, axis=-1, keepdims=True)
logits = np.array([0.3, 2.9, 4.0])
probs = softmax(logits)
print(probs) # [0.01821127 0.24519181 0.73659691]
print(np.sum(probs)) # 1.0
Loss Functions and Optimization
Loss Computation
- Mean Squared Error (MSE) for regression:
def mse_loss(y_pred, y_true): return 0.5 * np.mean((y_pred - y_true) ** 2) - Categorical Cross-Entropy for classification:
def cross_entropy_loss(y_pred, y_true): # y_true: one-hot encoded eps = 1e-7 return -np.sum(y_true * np.log(y_pred + eps)) / len(y_true)
Numerical Gradient and Gradient Descent
def numerical_gradient(func, x, h=1e-4):
grad = np.zeros_like(x)
it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
while not it.finished:
idx = it.multi_index
original = x[idx]
x[idx] = original + h
f_plus = func(x)
x[idx] = original - h
f_minus = func(x)
grad[idx] = (f_plus - f_minus) / (2 * h)
x[idx] = original
it.iternext()
return grad
def gradient_descent(func, x_init, lr=0.01, steps=100):
x = x_init.copy()
for _ in range(steps):
grad = numerical_gradient(func, x)
x -= lr * grad
return x
Backpropagation via Computational Graphs
Backpropagation computes gradients using the chain rule, propagating errors backward through operations.
Layer Abstractions
-
Multiplication layer:
class MulLayer: def __init__(self): self.x = self.y = None def forward(self, x, y): self.x, self.y = x, y return x * y def backward(self, dout): return dout * self.y, dout * self.x -
Addition layer:
class AddLayer: def forward(self, x, y): return x + y def backward(self, dout): return dout, dout -
ReLU layer:
class ReLULayer: def __init__(self): self.mask = None def forward(self, x): self.mask = x <= 0 out = x.copy() out[self.mask] = 0 return out def backward(self, dout): dout[self.mask] = 0 return dout -
Affine layer (fully connected):
class AffineLayer: def __init__(self, W, b): self.W, self.b = W, b self.x = self.dW = self.db = None def forward(self, x): self.x = x return x @ self.W + self.b def backward(self, dout): dx = dout @ self.W.T self.dW = self.x.T @ dout self.db = np.sum(dout, axis=0) return dx
Convolutional Neural Networks (CNNs)
CNNs preserve spatial structure using convolution and pooling.
Core Concepts
- Convolution: Sliding filter over input to produce feature maps.
- Padding: Zero-padding controls output spatial dimensions.
- Stride: Step size between filter applications.
- Pooling: Downsampling (e.g., max-pooling) reduces resolution and adds translation invariance.
Simple CNN Layer Stack
from collections import OrderedDict
class SimpleCNN:
def __init__(self):
self.layers = OrderedDict([
('conv1', ConvLayer(filter_num=32, filter_size=3, pad=1, stride=1)),
('relu1', ReLULayer()),
('pool1', PoolLayer(pool_h=2, pool_w=2, stride=2)),
('affine1', AffineLayer(W=np.random.randn(32*14*14, 100) * 0.01,
b=np.zeros(100))),
('relu2', ReLULayer()),
('affine2', AffineLayer(W=np.random.randn(100, 10) * 0.01,
b=np.zeros(10)))
])
self.last_layer = SoftmaxWithLoss()
def predict(self, x):
for layer in self.layers.values():
x = layer.forward(x)
return x
def loss(self, x, t):
y = self.predict(x)
return self.last_layer.forward(y, t)
def gradient(self, x, t):
self.loss(x, t)
dout = 1
dout = self.last_layer.backward(dout)
for layer in reversed(list(self.layers.values())):
dout = layer.backward(dout)
grads = {k: v.dW for k, v in self.layers.items() if hasattr(v, 'dW')}
grads.update({k: v.db for k, v in self.layers.items() if hasattr(v, 'db')})
return grads
Optimization Strategies Beyond SGD
- Momentum: Accumulates velocity to dampen oscillation.
- AdaGrad: Adapts learning rates per parameter using historical gradient squares.
- Adam: Combines momentum and adaptive learning rates; default β₁=0.9, β₂=0.999.
class AdamOptimizer:
def __init__(self, lr=0.001, beta1=0.9, beta2=0.999):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.m = self.v = self.t = None
def update(self, params, grads):
if self.m is None:
self.m = {k: np.zeros_like(v) for k, v in params.items()}
self.v = {k: np.zeros_like(v) for k, v in params.items()}
self.t = 0
self.t += 1
for k in params:
self.m[k] = self.beta1 * self.m[k] + (1 - self.beta1) * grads[k]
self.v[k] = self.beta2 * self.v[k] + (1 - self.beta2) * (grads[k] ** 2)
m_hat = self.m[k] / (1 - self.beta1 ** self.t)
v_hat = self.v[k] / (1 - self.beta2 ** self.t)
params[k] -= self.lr * m_hat / (np.sqrt(v_hat) + 1e-7)
Weight Initialization and Regularization
- Xavier initialization: For sigmoid/tanh — variance scaled by
1/n_in. - He initialization: For ReLU — variance scaled by
2/n_in. - Weight decay (L2 regularization): Adds penalty
λ∑w²to loss. - Dropout: Randomly deactivates neurons during training to reduce co-adaptation.
Batch Normalization
Normalizes layer inputs across mini-batches:
class BatchNorm:
def __init__(self, gamma=1.0, beta=0.0, eps=1e-5):
self.gamma, self.beta, self.eps = gamma, beta, eps
self.running_mean = self.running_var = None
def forward(self, x, train=True):
if train:
mu = np.mean(x, axis=0)
var = np.var(x, axis=0)
if self.running_mean is None:
self.running_mean = mu
self.running_var = var
else:
self.running_mean = 0.9 * self.running_mean + 0.1 * mu
self.running_var = 0.9 * self.running_var + 0.1 * var
x_centered = x - mu
inv_std = 1 / np.sqrt(var + self.eps)
x_norm = x_centered * inv_std
else:
x_norm = (x - self.running_mean) / np.sqrt(self.running_var + self.eps)
out = self.gamma * x_norm + self.beta
return out