Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Understanding the Transformer Architecture and Key Components

Tech May 18 2

Overall Architecture of Transformers

Taking machine translation as an example, when we input a text sequence, the model outputs a corresponding translated sequence. The Transformer operates as a black box in this process.

When expanding the black box, we can see an encoder-decoder structure. The model first encodes the input text into a representation optimized for machine processing, then decodes this representation into the desired output format.

The architecture contains multiple stacked encoder and decoder layers. The encoding process is performed progressively through multiple encoder layers, similar to how humans process text through multiple cognitive steps. The decoding process follows the same layered approach.

Looking at the internal structure of encoder and decoder layers: an encoder layer first passes input through a self-attention layer, followed by a feed-forward layer. A decoder has a similar structure, with an additional encoder-decoder attention layer inserted between the self-attention and feed-forward layers.

Attention Mechanism Calculation

The core of the Transformer is its attention mechanism. To compute attention, we first generate Q (Query), K (Key), and V (Value) vectors.

In deep learning, especially NLP, Q, K, and V are three core concepts in the attention mechanism, widely used in Transformer models and now foundational in advanced NLP models like BERT and GPT.

  • Q (Query): Represents the information demand from the current token being processed. It matches with all other tokens' keys to determine their importance to the current token.
  • K (Key): Carries information for each token. It matches with queries to generate attention weights indicating how important the corresponding value is for the current query.
  • V (Value): Contains the actual information content of each token. After determining query-key matching degrees, these values are weighted and summed based on attention weights to produce the final attention output.

The workflow is: Query sends information demand → Query compares with all Keys to compute relevance (usually via dot product) → Assign weights based on relevance → Apply weights to corresponding Values and sum to get the final result.

Generating Q, K, V Vectors

Take input tokens "a" and "b" as an example. First, we apply embedding to convert them into token vectors v1 and v2.

Embedding is the process of converting discrete data (like words, category labels) into continuous vectors (usually in low-dimensional space). Its a key step in deep learning models.

  • Purpose: Dimensionality reduction (convert high-dimensional sparse data to low-dimensional dense vectors), representation learning (similar data are close in embedding space), and information transfer to models.
  • Process: Initialize a random weight matrix → Update vectors via backpropagation during training → Look up the corresponding vector for a given discrete item in deployment.

We multiply these token vectors with learnable weight matrices W_Q, W_K, W_V to get the corresponding Q, K, V vectors for each token:

  • W_Q (Query Weight Matrix): Converts token vectors to query vectors for determining important information in context.
  • W_K (Key Weight Matrix): Converts token vectors to key vectors for matching with queries to compute attention weights.
  • W_V (Value Weight Matrix): Converts token vectors to value vectors containing actual information content for weighted summation.

Operations: Q1 = v1 × W_Q, K2 = v2 × W_K, V2 = v2 × W_V (where × denotes matrix multiplication).

Self-Attention Calculation Process

Take tokens "Thinking" and "Machines" as examples. After embedding, we get their token vectors, then multiply by the three weight matrices to get Q, K, V for each.

To compute self-attention for "Thinking", first calculate the Score (relevance) between its query vector and all key vectors in the sequence:

Score_Thinking,i = Q_Thinking · K_i (dot product of Q_Thinking and K_i)

The Score measures the correlation between the query token and other tokens. Higher scores mean stronger relevance. These scores are converted to a probability distribution via softmax to get attention weights.

Steps for "Thinking":

  1. Compute Score with itself: Score_Thinking,Thinking = Q_Thinking · K_Thinking
  2. Divide the Score by √d (where d is the dimension of the key vector, e.g., 8 in this example)
  3. Apply softmax to get normalized attention weights
  4. Multiply each weight by the corresponding Value vector, then sum the results to get the self-attention value for "Thinking"

The same process applies to "Machines". For multiple tokens, matrix operations are used for efficiency.

Single-Head vs Multi-Head Attention

Single-head attention uses one set of Q, K, V. Multi-head attention uses multiple sets of Q, K, V generated by multiple groups of learnable weight matrices.

Each head computes attention independently, capturing different features or relationships in the input. The outputs of all heads are concatenated, then multiplied by a linear projection matrix W_O to produce the final multi-head attention output.

Concatenation merges information from different attention heads, increasing the model's expressive power by learning in different subspaces while maintaining consistant input dimensions for subsequent layers.

Token Vector Encoding

Transformer vs RNN: Position Information

Consider two examples: (1) Sentences with identical words but opposite meanings due to word order; (2) Sentences with changed word order but same meaning. This shows that word position is critical for semantic meaning.

Unlike RNNs which process sequences sequentially and inherently capture position information, Transformers process all tokens in parallel without considering order. Thus, we must inject position information into token vectors to differentiate the same token at different positions.

Final Encoding: Token Embedding + Position Encoding

The final token encoding is the sum of the token embedding and position encoding (PE).

Position encoding uses sine and cosine functions to inject position information. For a position pos and dimension index i (with total dimension d):

P_pos,2i = sin(pos / 10000^(2i/d))

P_pos,2i+1 = cos(pos / 10000^(2i/d))

Even dimensions use sine, odd dimensions use cosine. As dimension increases, the wavelength of the sine/cosine functions grows geometrically, allowing the model to capture both local and long-distance position relationships.

Example: For d=512, pos=1:

  • i=0: P_1,0 = sin(1), P_1,1 = cos(1)
  • i=1: P_1,2 = sin(1 / 10000^(2/512)), P_1,3 = cos(1 / 10000^(2/512))
  • Repeat until i=255 to get the full 512-dimensional position encoding vector.

The position encoding vector is added to the corresponding token embedding to form the final encoded vector.

Masking in Transformers

PAD Mask

PAD Mask handles inconsistent sequence lengths in batches. Shorter sequences are padded with a special token to match the longest sequence in the batch. The PAD Mask is a binary vector where 1 (or True) marks padding positions, 0 (or False) marks actual data positions.

Example: Three sequences padded to length 4:

  • Sequence 1: ["我", "是", "", ""] → Mask: [0, 0, 1, 1]
  • Sequence 2: ["你", "是", "谁", ""] → Mask: [0, 0, 0, 1]
  • Sequence 3: ["他", "们", "是", "学生"] → Mask: [0, 0, 0, 0]

PAD Mask ensures padding positions are ignored in attention calculations and model training.

Upper Triangular Mask

Upper Triangular Mask maintains causal order in sequence generation tasks (e.g., language modeling). It masks future tokens so the model can only attend to current and past tokens when predicting the next token.

It is a square matrix of size equal to the sequence length, where all elements above the main diagonal are set to negative infinity (or a very large negative number), and elements on or below the diagonal are 0. When applied to attention scores, softmax assigns zero weight to masked positions.

Example for sequence length 4:


0  -inf  -inf  -inf
0   0    -inf  -inf
0   0     0    -inf
0   0     0     0

This ensures when computing attention for position 1, only position 1 is visible; for position 2, positions 1 and 2 are visible, etc.

Combining Masks

In practice, PAD Mask and Upper Triangular Mask are combined (union) to handle both padding and causal order constraints simultaneously.

Full Computational Pipeline

Encoder Workflow

  1. Convert input tokens to embeddings, add position encoding to get position-aware vectors x1, x2, ...
  2. Pass through multi-head self-attention layer to get attention outputs z1, z2, ...
  3. Apply residual connection: add z to the original x, then apply layer normalization
  4. Pass the normalized result through a feed-forward network (fully connected layers)
  5. Apply another residual connection and layer normalization to get the encoder output
  6. Repeat this process through multiple stacked encoder layers

Key terms:

  • Residual Connection: Direct connection between layers to alleviate gradient vanishing and improve training efficiency
  • Layer Normalization: Normalizes activations within a layer to stabilize and accelerate training
  • Feed-Forward Network: Two fully connected layers with a ReLU activation in between, applied independently to each token

Decoder Workflow

  1. Process target tokens similarly to encoder inputs (embedding + position encoding)
  2. Pass through masked multi-head self-attention layer (with upper triangular mask to prevent attending to future tokens)
  3. Apply residual connection and layer normalization
  4. Pass through encoder-decoder attention layer: uses encoder outputs as K and V, decoder's previous output as Q
  5. Apply residual connection and layer normalization
  6. Pass through feed-forward network, then residual connection and layer normalization
  7. Repeat through multiple stacked decoder layers
  8. Pass final output through a linear layer and softmax to generate probability distribution over the target vocabulary

Data Generation Strategy for Translation Tasks

For translation experiments, we generate source (X) and target (Y) sequences:

  • Source Language (X): Has a vocabulary of 7 words. Randomly sample sequences with different probabilities for each word, and random sequence lengths.
  • Target Language (Y): Generated from X by reversing the sequence, capitalizing letters, subtracting digits from 10, and making the first Y token depend on the last X token.
  • Add start/end tokens, pad sequences to fixed length with tokens.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Comprehensive Guide to Hive SQL Syntax and Operations

This article provides a detailed walkthrough of Hive SQL, categorizing its features and syntax for practical use. Hive SQL is segmented into the following categories: DDL Statements: Operations on...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.