Core Concepts and Architectures in NLP and Large Language Models
Natural Language Processing (NLP) enables computational systems to interpret and generate human language. Key tasks include text classification for spam filtering, sentiment analysis for social media monitoring, machine translation, automatic summarization, generative text creation, conversational agents for customer support, and speech-to-text conversion.
Fundamental Model Architectures
Sequential data processing in NLP relies heavily on specific neural network structures. The primary architectures include Recurrent Neural Networks (RNN), Long Short-Term Memory networks (LSTM), Gated Recurrent Units (GRU), and the Transformer.
Recurrent Neural Networks (RNN)
RNNs process sequences by maintaining a hidden state that propagates information across time steps. At each step $t$, the input $x_t$ combines with the previous hidden state $h_{t-1}$ to produce the current state $h_t$. Unlike feedforward networks where information flows strictly forward, RNNs share parameters across time steps, allowing them to handle variable-length inputs. However, standard RNNs suffer from vanishing or exploding gradients when processing long sequences due to repeated matrix multiplications during backpropagation.
Long Short-Term Memory (LSTM)
LSTMs mitigate the long-term dependency issue through gating mechanisms and a dedicated cell state $C_t$. Three gates regulate information flow:
- Forget Gate: Decides what information to discard from the cell state.
- Input Gate: Determines which new values to update in the cell state.
- Output Gate: Controls what parts of the cell state become the output hidden state. While effective, LSTMs introduce significant parameter overhead compared to simple RNNs.
Gated Recurrent Unit (GRU)
GRUs simplify the LSTM architecture by merging the forget and input gates into a single update gate and eliminating the separate cell state. This reduction in parameters often leads to faster training while maintaining performance on long sequences. The output structure matches standard RNNs, providing a balance between efficiency and capability.
Transformer Architecture
Introduced in 2017, the Transformer relies entirely on attention mechanisms, discarding recurrence. Key components include:
- Input Embedding: Converts token indices into dense vectors (e.g., 512 dimensions).
- Positional Encoding: Injects sequence order information using sine and cosine functions since the model lacks inherent recurrence.
- Multi-Head Attention: Projects inputs into Query (Q), Key (K), and Value (V) spaces across multiple heads to capture diverse relationships. The core operation is $\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$.
- Feed-Forward Networks: Position-wise fully connected layers applied after attention.
- Add & Norm: Residual connections followed by layer normalization stabilize training.
Encoder stacks process the input sequnece, while Decoder stacks generate output tokens autoregressively, masking future positions during training.
BERT (Bidirectional Encoder Representations from Transformers)
BERT utilizes the Transformer Encoder for bidirectional context understanding. Input representations sum Token, Segment, and Position embeddings. Pre-training involves:
- Masked Language Modeling (MLM): Randomly masking 15% of tokens and predicting them.
- Next Sentence Prediction (NSP): Predicting if two sentences are consecutive. Fine-tuning adapts these weights for downstream tasks like question answering by adding task-specific layers.
Inference and Training Dynamics
During Transformer inference for translation, the encoder processes the source sentence once. The decoder generates tokens sequentially: starting with a start token, it predicts the next word based on previous outputs and encoder context until an end token is produced. Training differs by feeding the entire target sequence to the decoder simultaneously, using a causal mask to prevent attending to future tokens, enabling parallel computation.
Embedding Techniques
Embeddings map discrete tokens to continuous vectors, solving sparsity issues inherent in one-hot encoding and capturing semantic relationships. Methods include:
- Word2Vec: Uses CBOW (predicting target from context) or Skip-gram (predicting context from target) to learn vectors.
- Contextual Embeddings: Models like BERT generate dynamic embeddings based on surrounding text.
Vertical Domain Adaptation
Adapting general models to specific industries involves:
- Fine-Tuning: Updating model weights on domain-specific instruction data. Techniques like LoRA reduce computational cost by training low-rank adaptation matrices.
- RAG (Retrieval-Augmented Generation): Augments prompts with retrieved external knowledge. This avoids retraining, ensures up-to-date information, and reduces hallucinations, though it adds latency and system complexity.
Retrieval-Augmented Generation (RAG)
RAG addresses LLM limitations such as hallucinations, knowledge cutoffs, and data privacy. The workflow involves indexing external documents, retrieving relevant chunks based on user queries, and injecting them into the prompt. Challenges include query understanding and handling diverse document formats. Retrieval strategies combine lexical methods (BM25, TF-IDF) with semantic search (vector similarity).
TF-IDF Algorithm
Term Frequency-Inverse Document Frequency weighs terms based on occurrence in a document versus rarity across the corpus. High TF-IDF scores indicate terms distinctive to a specific document.
Multi-Head Attention Implementation
The following PyTorch module implements scaled dot-product attention with multiple heads, incorporating causal masking and residual projection.
import torch
import torch.nn as nn
import math
class ScaledDotProductAttention(nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
assert d_model % n_heads == 0, "Dimension must be divisible by heads"
self.d_model = d_model
self.n_heads = n_heads
self.head_dim = d_model // n_heads
self.w_query = nn.Linear(d_model, d_model, bias=False)
self.w_key = nn.Linear(d_model, d_model, bias=False)
self.w_value = nn.Linear(d_model, d_model, bias=False)
self.output_proj = nn.Linear(d_model, d_model)
def forward(self, inputs):
batch_size, seq_len, _ = inputs.shape
# Linear projections
q = self.w_query(inputs)
k = self.w_key(inputs)
v = self.w_value(inputs)
# Reshape for multi-head processing: (batch, heads, seq, dim)
q = q.view(batch_size, seq_len, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
k = k.view(batch_size, seq_len, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
v = v.view(batch_size, seq_len, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
# Scaled dot-product
scale = 1.0 / math.sqrt(self.head_dim)
scores = torch.matmul(q, k.transpose(-2, -1)) * scale
# Apply causal mask
mask = torch.triu(torch.ones_like(scores), diagonal=1) * -1e9
scores = scores + mask
weights = torch.softmax(scores, dim=-1)
context = torch.matmul(weights, v)
# Concatenate heads and project
context = context.permute(0, 2, 1, 3).contiguous().view(batch_size, seq_len, self.d_model)
return self.output_proj(context)
Key design choices include splitting dimensions for parallel head processing and scaling dot products to prevent softmax saturation. Variants like Multi-Query Attention (MQA) share K/V heads to reduce memory usage during inference.
Sequence Handling and Regularization
Padding: Sequences are padded to a uniform length within a batch using special tokens ignored during loss calculation. This enables batched matrix operations.
Dropout: Randomly zeroes elements during training with a probability $p$. This prevents co-adaptation of neurons and reduces overfitting. Input and output layers are typically excluded.
Prominent Model Families
| Model | Organization | Architecture | Notes |
|---|---|---|---|
| BERT | Encoder | Bidirectional context, MLM pre-training | |
| GPT Series | OpenAI | Decoder | Autoregressive, strong generative capabilities |
| Llama | Meta | Decoder | Open weights, optimized for efficiency |
| Gemini | Google DeepMind | Multimodal | Native multimodal training |
| Qwen | Alibaba | Decoder | Strong multilingual support |
Semantic Retrieval Pipeline
- Ingestion: Parse diverse formats (PDF, DOCX) into raw text.
- Chunking: Split text into overlapping segments to preserve context while fitting model limits.
- Embedding: Convert chunks into vectors using models like BERT or specialized embedding models.
- Search: Compute similarity between query vectors and document vectors.
Example retrieval logic using vector similarity:
from sentence_transformers import SentenceTransformer
import numpy as np
def retrieve_relevant_docs(query, document_corpus, model_name='m3e-small'):
encoder = SentenceTransformer(model_name)
# Encode query and corpus
query_vector = encoder.encode([query], normalize_embeddings=True)
doc_vectors = encoder.encode(document_corpus, normalize_embeddings=True)
# Compute cosine similarity via dot product (normalized)
similarities = np.dot(query_vector, doc_vectors.T)[0]
# Identify top result
best_match_idx = np.argmax(similarities)
return document_corpus[best_match_idx], similarities[best_match_idx]
Normalization Strategies
Layer Normalization: Normalizes across the feature dimension for each individual sample. Standard in Transformers as it is independent of batch size.
Batch Normalization: Normalizes across the batch dimension for each feature. Common in CNNs but less stable for NLP sequence tasks due to variable lengths.
Optimizing RAG Performance
To improve recall:
- Hybrid Search: Combine keyword-based (BM25) and vector-based retrieval.
- Re-ranking: Use cross-encoder models to score top-k retrieved candidates more accurate.
- Chunk Optimizaton: Adjust segment size and overlap to balance context retention and precision.
Positional Encoding Rationale
Transformers lack recurrence, so position encodings are added to embeddings to preserve order. Sinusoidal functions allow the model to learn relative positions because $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$.
Model Selection and Evaluation
Embedding Models: Select based on benchmarks like MTEB, considering language support and domain specificity.
Residual Connections: Add input to output before normalization ($x + F(x)$). This facilitates gradient flow in deep networks.
Loss Functions:
- Cross-Entropy: Standard for classification, penalizes confident wrong predictions heavily.
- MSE: Suitable for regression, sensitive to outliers.
RAG Evaluation:
- Unsupervised: Use LLMs to judge faithfulness and relevance.
- Supervised: Compare against ground truth answers and source citations.
Parameter-Efficient Fine-Tuning (PEFT)
- LoRA: Decomposes weight updates into low-rank matrices, training only these small matrices while freezing the base model.
- Prefix Tuning: Prepends learnable vectors to the input sequence, modifying attention without changing weights.
Compression and Acceleration
- Pruning: Removes insignificant weights.
- Quantization: Reduces precision (e.g., FP16 to INT8) to lower memory footprint.
- Distillation: Trains a smaller student model to mimic a larger teacher.
Similarity Metrics
- Cosine Similarity: Measures angular difference, range [-1, 1].
- Euclidean Distance: Measures straight-line distance, sensitive to magnitude.
- Dot Product: Efficient for normalized vectors, combines magnitude and direction.
Architectural Comparisons
BERT vs. GPT: BERT uses an Encoder-only stack for bidirectional understanding (ideal for classification). GPT uses a Decoder-only stack for autoregressive generation (ideal for text creation).
Perplexity (PPL): Measures how well a probability model predicts a sample. Lower perplexity indicates better generalization.
Advanced Topics
Long Context Handling: Techniques include sliding windows, sparse attention patterns, and memory mechanisms to extend beyond standard token limits.
Q, K, V Roles: Query represents the current token seeking information; Key represents tokens being searched; Value contains the actual content to be aggregated.
RLHF (Reinforcement Learning from Human Feedback): Aligns model outputs with human preferences using a reward model trained on human rankings, optimized via PPO.
Gradient Attacks: Adversarial inputs designed to manipulate model outputs or extract private data through gradient analysis.
KG + LLM vs. RAG: Knowledge Graphs provide structured reasoning and factual consistency but require maintenance. RAG offers flexibility with unstructured data but may lack explicit reasoning paths.