Home > Tech > Content

Understanding the Transformer Architecture and Key Components

Tech May 18 14

Overall Architecture of Transformers

Taking machine translation as an example, when we input a text sequence, the model outputs a corresponding translated sequence. The Transformer operates as a black box in this process.

When expanding the black box, we can see an encoder-decoder structure. The model first encodes the input text into a representation optimized for machine processing, then decodes this representation into the desired output format.

The architecture contains multiple stacked encoder and decoder layers. The encoding process is performed progressively through multiple encoder layers, similar to how humans process text through multiple cognitive steps. The decoding process follows the same layered approach.

Looking at the internal structure of encoder and decoder layers: an encoder layer first passes input through a self-attention layer, followed by a feed-forward layer. A decoder has a similar structure, with an additional encoder-decoder attention layer inserted between the self-attention and feed-forward layers.

Attention Mechanism Calculation

The core of the Transformer is its attention mechanism. To compute attention, we first generate Q (Query), K (Key), and V (Value) vectors.

In deep learning, especially NLP, Q, K, and V are three core concepts in the attention mechanism, widely used in Transformer models and now foundational in advanced NLP models like BERT and GPT.

Q (Query): Represents the information demand from the current token being processed. It matches with all other tokens' keys to determine their importance to the current token.
K (Key): Carries information for each token. It matches with queries to generate attention weights indicating how important the corresponding value is for the current query.
V (Value): Contains the actual information content of each token. After determining query-key matching degrees, these values are weighted and summed based on attention weights to produce the final attention output.

The workflow is: Query sends information demand → Query compares with all Keys to compute relevance (usually via dot product) → Assign weights based on relevance → Apply weights to corresponding Values and sum to get the final result.

Generating Q, K, V Vectors

Take input tokens "a" and "b" as an example. First, we apply embedding to convert them into token vectors v1 and v2.

Embedding is the process of converting discrete data (like words, category labels) into continuous vectors (usually in low-dimensional space). Its a key step in deep learning models.

Purpose: Dimensionality reduction (convert high-dimensional sparse data to low-dimensional dense vectors), representation learning (similar data are close in embedding space), and information transfer to models.
Process: Initialize a random weight matrix → Update vectors via backpropagation during training → Look up the corresponding vector for a given discrete item in deployment.

We multiply these token vectors with learnable weight matrices W_Q, W_K, W_V to get the corresponding Q, K, V vectors for each token:

W_Q (Query Weight Matrix): Converts token vectors to query vectors for determining important information in context.
W_K (Key Weight Matrix): Converts token vectors to key vectors for matching with queries to compute attention weights.
W_V (Value Weight Matrix): Converts token vectors to value vectors containing actual information content for weighted summation.

Operations: Q1 = v1 × W_Q, K2 = v2 × W_K, V2 = v2 × W_V (where × denotes matrix multiplication).

Self-Attention Calculation Process

Take tokens "Thinking" and "Machines" as examples. After embedding, we get their token vectors, then multiply by the three weight matrices to get Q, K, V for each.

To compute self-attention for "Thinking", first calculate the Score (relevance) between its query vector and all key vectors in the sequence:

Score_Thinking,i = Q_Thinking · K_i (dot product of Q_Thinking and K_i)

The Score measures the correlation between the query token and other tokens. Higher scores mean stronger relevance. These scores are converted to a probability distribution via softmax to get attention weights.

Steps for "Thinking":

Compute Score with itself: Score_Thinking,Thinking = Q_Thinking · K_Thinking
Divide the Score by √d (where d is the dimension of the key vector, e.g., 8 in this example)
Apply softmax to get normalized attention weights
Multiply each weight by the corresponding Value vector, then sum the results to get the self-attention value for "Thinking"

The same process applies to "Machines". For multiple tokens, matrix operations are used for efficiency.

Single-Head vs Multi-Head Attention

Single-head attention uses one set of Q, K, V. Multi-head attention uses multiple sets of Q, K, V generated by multiple groups of learnable weight matrices.

Each head computes attention independently, capturing different features or relationships in the input. The outputs of all heads are concatenated, then multiplied by a linear projection matrix W_O to produce the final multi-head attention output.

Concatenation merges information from different attention heads, increasing the model's expressive power by learning in different subspaces while maintaining consistant input dimensions for subsequent layers.

Token Vector Encoding

Transformer vs RNN: Position Information

Consider two examples: (1) Sentences with identical words but opposite meanings due to word order; (2) Sentences with changed word order but same meaning. This shows that word position is critical for semantic meaning.

Unlike RNNs which process sequences sequentially and inherently capture position information, Transformers process all tokens in parallel without considering order. Thus, we must inject position information into token vectors to differentiate the same token at different positions.

Final Encoding: Token Embedding + Position Encoding

The final token encoding is the sum of the token embedding and position encoding (PE).

Position encoding uses sine and cosine functions to inject position information. For a position pos and dimension index i (with total dimension d):

P_pos,2i = sin(pos / 10000^(2i/d))

P_pos,2i+1 = cos(pos / 10000^(2i/d))

Even dimensions use sine, odd dimensions use cosine. As dimension increases, the wavelength of the sine/cosine functions grows geometrically, allowing the model to capture both local and long-distance position relationships.

Example: For d=512, pos=1:

i=0: P_1,0 = sin(1), P_1,1 = cos(1)
i=1: P_1,2 = sin(1 / 10000^(2/512)), P_1,3 = cos(1 / 10000^(2/512))
Repeat until i=255 to get the full 512-dimensional position encoding vector.

The position encoding vector is added to the corresponding token embedding to form the final encoded vector.

Masking in Transformers

PAD Mask

PAD Mask handles inconsistent sequence lengths in batches. Shorter sequences are padded with a special token to match the longest sequence in the batch. The PAD Mask is a binary vector where 1 (or True) marks padding positions, 0 (or False) marks actual data positions.

Example: Three sequences padded to length 4:

Sequence 1: ["我", "是", "", ""] → Mask: [0, 0, 1, 1]
Sequence 2: ["你", "是", "谁", ""] → Mask: [0, 0, 0, 1]
Sequence 3: ["他", "们", "是", "学生"] → Mask: [0, 0, 0, 0]

PAD Mask ensures padding positions are ignored in attention calculations and model training.

Upper Triangular Mask

Upper Triangular Mask maintains causal order in sequence generation tasks (e.g., language modeling). It masks future tokens so the model can only attend to current and past tokens when predicting the next token.

It is a square matrix of size equal to the sequence length, where all elements above the main diagonal are set to negative infinity (or a very large negative number), and elements on or below the diagonal are 0. When applied to attention scores, softmax assigns zero weight to masked positions.

Example for sequence length 4:


0  -inf  -inf  -inf
0   0    -inf  -inf
0   0     0    -inf
0   0     0     0

This ensures when computing attention for position 1, only position 1 is visible; for position 2, positions 1 and 2 are visible, etc.

Combining Masks

In practice, PAD Mask and Upper Triangular Mask are combined (union) to handle both padding and causal order constraints simultaneously.

Full Computational Pipeline

Encoder Workflow

Convert input tokens to embeddings, add position encoding to get position-aware vectors x1, x2, ...
Pass through multi-head self-attention layer to get attention outputs z1, z2, ...
Apply residual connection: add z to the original x, then apply layer normalization
Pass the normalized result through a feed-forward network (fully connected layers)
Apply another residual connection and layer normalization to get the encoder output
Repeat this process through multiple stacked encoder layers

Key terms:

Residual Connection: Direct connection between layers to alleviate gradient vanishing and improve training efficiency
Layer Normalization: Normalizes activations within a layer to stabilize and accelerate training
Feed-Forward Network: Two fully connected layers with a ReLU activation in between, applied independently to each token

Decoder Workflow

Process target tokens similarly to encoder inputs (embedding + position encoding)
Pass through masked multi-head self-attention layer (with upper triangular mask to prevent attending to future tokens)
Apply residual connection and layer normalization
Pass through encoder-decoder attention layer: uses encoder outputs as K and V, decoder's previous output as Q
Apply residual connection and layer normalization
Pass through feed-forward network, then residual connection and layer normalization
Repeat through multiple stacked decoder layers
Pass final output through a linear layer and softmax to generate probability distribution over the target vocabulary

Data Generation Strategy for Translation Tasks

For translation experiments, we generate source (X) and target (Y) sequences:

Source Language (X): Has a vocabulary of 7 words. Randomly sample sequences with different probabilities for each word, and random sequence lengths.
Target Language (Y): Generated from X by reversing the sequence, capitalizing letters, subtracting digits from 10, and making the first Y token depend on the last X token.
Add start/end tokens, pad sequences to fixed length with tokens.

Tags: Transformer Self-Attention

Back to List

Prev: Understanding Linux Environment Variables and Process Context

Next: Resolving Filebeat 7.10.2 Startup Error on Ubuntu 22.04 due to Missing rseq Syscall

Fading Coder

Understanding the Transformer Architecture and Key Components

Overall Architecture of Transformers

Attention Mechanism Calculation

Generating Q, K, V Vectors

Self-Attention Calculation Process

Single-Head vs Multi-Head Attention

Token Vector Encoding

Transformer vs RNN: Position Information

Final Encoding: Token Embedding + Position Encoding

Masking in Transformers

PAD Mask

Upper Triangular Mask

Combining Masks

Full Computational Pipeline

Encoder Workflow

Decoder Workflow

Data Generation Strategy for Translation Tasks

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Understanding the Transformer Architecture and Key Components

Overall Architecture of Transformers

Attention Mechanism Calculation

Generating Q, K, V Vectors

Self-Attention Calculation Process

Single-Head vs Multi-Head Attention

Token Vector Encoding

Transformer vs RNN: Position Information

Final Encoding: Token Embedding + Position Encoding

Masking in Transformers

PAD Mask

Upper Triangular Mask

Combining Masks

Full Computational Pipeline

Encoder Workflow

Decoder Workflow

Data Generation Strategy for Translation Tasks

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment