Home > Tech > Content

Core Concepts and Architectures in NLP and Large Language Models

Tech May 15 1

Natural Language Processing (NLP) enables computational systems to interpret and generate human language. Key tasks include text classification for spam filtering, sentiment analysis for social media monitoring, machine translation, automatic summarization, generative text creation, conversational agents for customer support, and speech-to-text conversion.

Fundamental Model Architectures

Sequential data processing in NLP relies heavily on specific neural network structures. The primary architectures include Recurrent Neural Networks (RNN), Long Short-Term Memory networks (LSTM), Gated Recurrent Units (GRU), and the Transformer.

Recurrent Neural Networks (RNN)

RNNs process sequences by maintaining a hidden state that propagates information across time steps. At each step $t$, the input $x_t$ combines with the previous hidden state $h_{t-1}$ to produce the current state $h_t$. Unlike feedforward networks where information flows strictly forward, RNNs share parameters across time steps, allowing them to handle variable-length inputs. However, standard RNNs suffer from vanishing or exploding gradients when processing long sequences due to repeated matrix multiplications during backpropagation.

Long Short-Term Memory (LSTM)

LSTMs mitigate the long-term dependency issue through gating mechanisms and a dedicated cell state $C_t$. Three gates regulate information flow:

Forget Gate: Decides what information to discard from the cell state.
Input Gate: Determines which new values to update in the cell state.
Output Gate: Controls what parts of the cell state become the output hidden state. While effective, LSTMs introduce significant parameter overhead compared to simple RNNs.

Gated Recurrent Unit (GRU)

GRUs simplify the LSTM architecture by merging the forget and input gates into a single update gate and eliminating the separate cell state. This reduction in parameters often leads to faster training while maintaining performance on long sequences. The output structure matches standard RNNs, providing a balance between efficiency and capability.

Transformer Architecture

Introduced in 2017, the Transformer relies entirely on attention mechanisms, discarding recurrence. Key components include:

Input Embedding: Converts token indices into dense vectors (e.g., 512 dimensions).
Positional Encoding: Injects sequence order information using sine and cosine functions since the model lacks inherent recurrence.
Multi-Head Attention: Projects inputs into Query (Q), Key (K), and Value (V) spaces across multiple heads to capture diverse relationships. The core operation is $\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$.
Feed-Forward Networks: Position-wise fully connected layers applied after attention.
Add & Norm: Residual connections followed by layer normalization stabilize training.

Encoder stacks process the input sequnece, while Decoder stacks generate output tokens autoregressively, masking future positions during training.

BERT (Bidirectional Encoder Representations from Transformers)

BERT utilizes the Transformer Encoder for bidirectional context understanding. Input representations sum Token, Segment, and Position embeddings. Pre-training involves:

Masked Language Modeling (MLM): Randomly masking 15% of tokens and predicting them.
Next Sentence Prediction (NSP): Predicting if two sentences are consecutive. Fine-tuning adapts these weights for downstream tasks like question answering by adding task-specific layers.

Inference and Training Dynamics

During Transformer inference for translation, the encoder processes the source sentence once. The decoder generates tokens sequentially: starting with a start token, it predicts the next word based on previous outputs and encoder context until an end token is produced. Training differs by feeding the entire target sequence to the decoder simultaneously, using a causal mask to prevent attending to future tokens, enabling parallel computation.

Embedding Techniques

Embeddings map discrete tokens to continuous vectors, solving sparsity issues inherent in one-hot encoding and capturing semantic relationships. Methods include:

Word2Vec: Uses CBOW (predicting target from context) or Skip-gram (predicting context from target) to learn vectors.
Contextual Embeddings: Models like BERT generate dynamic embeddings based on surrounding text.

Vertical Domain Adaptation

Adapting general models to specific industries involves:

Fine-Tuning: Updating model weights on domain-specific instruction data. Techniques like LoRA reduce computational cost by training low-rank adaptation matrices.
RAG (Retrieval-Augmented Generation): Augments prompts with retrieved external knowledge. This avoids retraining, ensures up-to-date information, and reduces hallucinations, though it adds latency and system complexity.

Retrieval-Augmented Generation (RAG)

RAG addresses LLM limitations such as hallucinations, knowledge cutoffs, and data privacy. The workflow involves indexing external documents, retrieving relevant chunks based on user queries, and injecting them into the prompt. Challenges include query understanding and handling diverse document formats. Retrieval strategies combine lexical methods (BM25, TF-IDF) with semantic search (vector similarity).

TF-IDF Algorithm

Term Frequency-Inverse Document Frequency weighs terms based on occurrence in a document versus rarity across the corpus. High TF-IDF scores indicate terms distinctive to a specific document.

Multi-Head Attention Implementation

The following PyTorch module implements scaled dot-product attention with multiple heads, incorporating causal masking and residual projection.

import torch
import torch.nn as nn
import math

class ScaledDotProductAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        assert d_model % n_heads == 0, "Dimension must be divisible by heads"
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads
        
        self.w_query = nn.Linear(d_model, d_model, bias=False)
        self.w_key = nn.Linear(d_model, d_model, bias=False)
        self.w_value = nn.Linear(d_model, d_model, bias=False)
        self.output_proj = nn.Linear(d_model, d_model)

    def forward(self, inputs):
        batch_size, seq_len, _ = inputs.shape
        
        # Linear projections
        q = self.w_query(inputs)
        k = self.w_key(inputs)
        v = self.w_value(inputs)
        
        # Reshape for multi-head processing: (batch, heads, seq, dim)
        q = q.view(batch_size, seq_len, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        k = k.view(batch_size, seq_len, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        v = v.view(batch_size, seq_len, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        
        # Scaled dot-product
        scale = 1.0 / math.sqrt(self.head_dim)
        scores = torch.matmul(q, k.transpose(-2, -1)) * scale
        
        # Apply causal mask
        mask = torch.triu(torch.ones_like(scores), diagonal=1) * -1e9
        scores = scores + mask
        
        weights = torch.softmax(scores, dim=-1)
        context = torch.matmul(weights, v)
        
        # Concatenate heads and project
        context = context.permute(0, 2, 1, 3).contiguous().view(batch_size, seq_len, self.d_model)
        return self.output_proj(context)

Key design choices include splitting dimensions for parallel head processing and scaling dot products to prevent softmax saturation. Variants like Multi-Query Attention (MQA) share K/V heads to reduce memory usage during inference.

Sequence Handling and Regularization

Padding: Sequences are padded to a uniform length within a batch using special tokens ignored during loss calculation. This enables batched matrix operations.

Dropout: Randomly zeroes elements during training with a probability $p$. This prevents co-adaptation of neurons and reduces overfitting. Input and output layers are typically excluded.

Prominent Model Families

Model	Organization	Architecture	Notes
BERT	Google	Encoder	Bidirectional context, MLM pre-training
GPT Series	OpenAI	Decoder	Autoregressive, strong generative capabilities
Llama	Meta	Decoder	Open weights, optimized for efficiency
Gemini	Google DeepMind	Multimodal	Native multimodal training
Qwen	Alibaba	Decoder	Strong multilingual support

Semantic Retrieval Pipeline

Ingestion: Parse diverse formats (PDF, DOCX) into raw text.
Chunking: Split text into overlapping segments to preserve context while fitting model limits.
Embedding: Convert chunks into vectors using models like BERT or specialized embedding models.
Search: Compute similarity between query vectors and document vectors.

Example retrieval logic using vector similarity:

from sentence_transformers import SentenceTransformer
import numpy as np

def retrieve_relevant_docs(query, document_corpus, model_name='m3e-small'):
    encoder = SentenceTransformer(model_name)
    
    # Encode query and corpus
    query_vector = encoder.encode([query], normalize_embeddings=True)
    doc_vectors = encoder.encode(document_corpus, normalize_embeddings=True)
    
    # Compute cosine similarity via dot product (normalized)
    similarities = np.dot(query_vector, doc_vectors.T)[0]
    
    # Identify top result
    best_match_idx = np.argmax(similarities)
    return document_corpus[best_match_idx], similarities[best_match_idx]

Normalization Strategies

Layer Normalization: Normalizes across the feature dimension for each individual sample. Standard in Transformers as it is independent of batch size.

Batch Normalization: Normalizes across the batch dimension for each feature. Common in CNNs but less stable for NLP sequence tasks due to variable lengths.

Optimizing RAG Performance

To improve recall:

Hybrid Search: Combine keyword-based (BM25) and vector-based retrieval.
Re-ranking: Use cross-encoder models to score top-k retrieved candidates more accurate.
Chunk Optimizaton: Adjust segment size and overlap to balance context retention and precision.

Positional Encoding Rationale

Transformers lack recurrence, so position encodings are added to embeddings to preserve order. Sinusoidal functions allow the model to learn relative positions because $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$.

Model Selection and Evaluation

Embedding Models: Select based on benchmarks like MTEB, considering language support and domain specificity.

Residual Connections: Add input to output before normalization ($x + F(x)$). This facilitates gradient flow in deep networks.

Loss Functions:

Cross-Entropy: Standard for classification, penalizes confident wrong predictions heavily.
MSE: Suitable for regression, sensitive to outliers.

RAG Evaluation:

Unsupervised: Use LLMs to judge faithfulness and relevance.
Supervised: Compare against ground truth answers and source citations.

Parameter-Efficient Fine-Tuning (PEFT)

LoRA: Decomposes weight updates into low-rank matrices, training only these small matrices while freezing the base model.
Prefix Tuning: Prepends learnable vectors to the input sequence, modifying attention without changing weights.

Compression and Acceleration

Pruning: Removes insignificant weights.
Quantization: Reduces precision (e.g., FP16 to INT8) to lower memory footprint.
Distillation: Trains a smaller student model to mimic a larger teacher.

Similarity Metrics

Cosine Similarity: Measures angular difference, range [-1, 1].
Euclidean Distance: Measures straight-line distance, sensitive to magnitude.
Dot Product: Efficient for normalized vectors, combines magnitude and direction.

Architectural Comparisons

BERT vs. GPT: BERT uses an Encoder-only stack for bidirectional understanding (ideal for classification). GPT uses a Decoder-only stack for autoregressive generation (ideal for text creation).

Perplexity (PPL): Measures how well a probability model predicts a sample. Lower perplexity indicates better generalization.

Advanced Topics

Long Context Handling: Techniques include sliding windows, sparse attention patterns, and memory mechanisms to extend beyond standard token limits.

Q, K, V Roles: Query represents the current token seeking information; Key represents tokens being searched; Value contains the actual content to be aggregated.

RLHF (Reinforcement Learning from Human Feedback): Aligns model outputs with human preferences using a reward model trained on human rankings, optimized via PPO.

Gradient Attacks: Adversarial inputs designed to manipulate model outputs or extract private data through gradient analysis.

KG + LLM vs. RAG: Knowledge Graphs provide structured reasoning and factual consistency but require maintenance. RAG offers flexibility with unstructured data but may lack explicit reasoning paths.

Back to List

Prev: Algorithmic Solutions for the YsOI2023 Contest

Next: Using Kafka and Flink for Real-Time Data Processing

Fading Coder

Core Concepts and Architectures in NLP and Large Language Models

Fundamental Model Architectures

Recurrent Neural Networks (RNN)

Long Short-Term Memory (LSTM)

Gated Recurrent Unit (GRU)

Transformer Architecture

BERT (Bidirectional Encoder Representations from Transformers)

Inference and Training Dynamics

Embedding Techniques

Vertical Domain Adaptation

Retrieval-Augmented Generation (RAG)

TF-IDF Algorithm

Multi-Head Attention Implementation

Sequence Handling and Regularization

Prominent Model Families

Semantic Retrieval Pipeline

Normalization Strategies

Optimizing RAG Performance

Positional Encoding Rationale

Model Selection and Evaluation

Parameter-Efficient Fine-Tuning (PEFT)

Compression and Acceleration

Similarity Metrics

Architectural Comparisons

Advanced Topics

Related Articles

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

SBUS Signal Analysis and Communication Implementation Using STM32 with Fus Remote Controller

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Core Concepts and Architectures in NLP and Large Language Models

Fundamental Model Architectures

Recurrent Neural Networks (RNN)

Long Short-Term Memory (LSTM)

Gated Recurrent Unit (GRU)

Transformer Architecture

BERT (Bidirectional Encoder Representations from Transformers)

Inference and Training Dynamics

Embedding Techniques

Vertical Domain Adaptation

Retrieval-Augmented Generation (RAG)

TF-IDF Algorithm

Multi-Head Attention Implementation

Sequence Handling and Regularization

Prominent Model Families

Semantic Retrieval Pipeline

Normalization Strategies

Optimizing RAG Performance

Positional Encoding Rationale

Model Selection and Evaluation

Parameter-Efficient Fine-Tuning (PEFT)

Compression and Acceleration

Similarity Metrics

Architectural Comparisons

Advanced Topics

Related Articles

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

SBUS Signal Analysis and Communication Implementation Using STM32 with Fus Remote Controller

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment