Fading Coder

One Final Commit for the Last Sprint

Home > Notes > Content

Token Embeddings and Sinusoidal Positional Encoding in Transformer Architectures

Notes 1

Token Embeddings

Token embedding is the process of representing discrete units of text, such as words or subwords, as continuous high-dimensional vectors. Since neural networks perform mathematical operations on numerical data, raw text must be converted into a format that captures semantic relationships. In a Transformer model, a vocabulary is defined where each unique token is assigned a specific index.

For instance, given a vocabulary containing indices for ["The", "cat", "sits", "on", "the", "mat"], a sentence like "The cat sits" might be converted into a sequence of integers: [0, 1, 2]. Each integer index is then used to look up a corresponding row in an embedding matrix of size $V \times d_{model}$, where $V$ is the vocabulary size and $d_{model}$ is the hiddan dimension (e.g., 512).

Implementation with PyTorch

The nn.Embedding module serves as a lookup table where weights are learned during the training phase.

import torch
import torch.nn as nn

# Configuration parameters
vocab_size = 10
feature_dim = 128

# Initialize the embedding layer
word_lookup = nn.Embedding(vocab_size, feature_dim)

# Sample input sequence (batch size = 1, sequence length = 4)
input_tokens = torch.tensor([2, 5, 0, 9], dtype=torch.long)

# Generate dense vectors
embedded_vectors = word_lookup(input_tokens)

print(f"Input Shape: {input_tokens.shape}")
print(f"Output Shape: {embedded_vectors.shape}")
print(f"Vector for index 2: {embedded_vectors[0][:5]}...")

Positional Encoding

Transformers process entire sequences simultaneous rather than sequentially. While this parallelism improves efficiency, the model inherently lacks information regarding the order of tokens. To compensate for this, positional encodings are added to the token embeddings. These encodings provide the model with a sense of where each word appears in the sequence.

The standard Transformer uses fixed sinusoidal functions to generate unique patterns for each position:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$ $$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

Here, $pos$ represents the position in the sequence, and $i$ represents the dimension index. This approach allows the model to learn relative positions since $PE_{pos+k}$ can be expressed as a linear function of $PE_{pos}$.

Positional Encoding Implementation

Using logarithmic space for the denominator calculation is a common optimization to ensure numerical stability.

import torch
import torch.nn as nn
import math

class FixedPositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        # Initialize matrix of zeros
        pe = torch.zeros(max_len, d_model)
        
        # Vector of positions
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        
        # Compute the division term in log space
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        
        # Apply sine to even indices and cosine to odd indices
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # Register pe as a buffer (not a trainable parameter)
        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        # x shape: [batch_size, seq_len, d_model]
        return x + self.pe[:, :x.size(1)]

Integrating Embeddings and Position

The final input to the Transformer encoder or decoder is the element-wise sum of the token embedding and the positional encoding. Dropout is typically applied to this sum to improve regularization.

class TransformerInputLayer(nn.Module):
    def __init__(self, v_size, d_model, max_sequence, dropout_rate=0.1):
        super().__init__()
        self.token_mapping = nn.Embedding(v_size, d_model)
        self.pos_mapping = FixedPositionalEncoding(d_model, max_sequence)
        self.dropout = nn.Dropout(p=dropout_rate)
        self.scaling = math.sqrt(d_model)

    def forward(self, indices):
        # Map tokens and scale by sqrt of d_model
        x = self.token_mapping(indices) * self.scaling
        # Inject positional information
        x = self.pos_mapping(x)
        return self.dropout(x)

Related Articles

Designing Alertmanager Templates for Prometheus Notifications

How to craft Alertmanager templates to format alert messages, improving clarity and presentation. Alertmanager uses Go’s text/template engine with additional helper functions. Alerting rules referenc...

Deploying a Maven Web Application to Tomcat 9 Using the Tomcat Manager

Tomcat 9 does not provide a dedicated Maven plugin. The Tomcat Manager interface, however, is backward-compatible, so the Tomcat 7 Maven Plugin can be used to deploy to Tomcat 9. This guide shows two...

Skipping Errors in MySQL Asynchronous Replication

When a replica halts because the SQL thread encounters an error, you can resume replication by skipping the problematic event(s). Two common approaches are available. Methods to Skip Errors 1) Skip a...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.