Token Embeddings and Sinusoidal Positional Encoding in Transformer Architectures
Token Embeddings
Token embedding is the process of representing discrete units of text, such as words or subwords, as continuous high-dimensional vectors. Since neural networks perform mathematical operations on numerical data, raw text must be converted into a format that captures semantic relationships. In a Transformer model, a vocabulary is defined where each unique token is assigned a specific index.
For instance, given a vocabulary containing indices for ["The", "cat", "sits", "on", "the", "mat"], a sentence like "The cat sits" might be converted into a sequence of integers: [0, 1, 2]. Each integer index is then used to look up a corresponding row in an embedding matrix of size $V \times d_{model}$, where $V$ is the vocabulary size and $d_{model}$ is the hiddan dimension (e.g., 512).
Implementation with PyTorch
The nn.Embedding module serves as a lookup table where weights are learned during the training phase.
import torch
import torch.nn as nn
# Configuration parameters
vocab_size = 10
feature_dim = 128
# Initialize the embedding layer
word_lookup = nn.Embedding(vocab_size, feature_dim)
# Sample input sequence (batch size = 1, sequence length = 4)
input_tokens = torch.tensor([2, 5, 0, 9], dtype=torch.long)
# Generate dense vectors
embedded_vectors = word_lookup(input_tokens)
print(f"Input Shape: {input_tokens.shape}")
print(f"Output Shape: {embedded_vectors.shape}")
print(f"Vector for index 2: {embedded_vectors[0][:5]}...")
Positional Encoding
Transformers process entire sequences simultaneous rather than sequentially. While this parallelism improves efficiency, the model inherently lacks information regarding the order of tokens. To compensate for this, positional encodings are added to the token embeddings. These encodings provide the model with a sense of where each word appears in the sequence.
The standard Transformer uses fixed sinusoidal functions to generate unique patterns for each position:
$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$ $$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$
Here, $pos$ represents the position in the sequence, and $i$ represents the dimension index. This approach allows the model to learn relative positions since $PE_{pos+k}$ can be expressed as a linear function of $PE_{pos}$.
Positional Encoding Implementation
Using logarithmic space for the denominator calculation is a common optimization to ensure numerical stability.
import torch
import torch.nn as nn
import math
class FixedPositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super().__init__()
# Initialize matrix of zeros
pe = torch.zeros(max_len, d_model)
# Vector of positions
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
# Compute the division term in log space
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
# Apply sine to even indices and cosine to odd indices
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
# Register pe as a buffer (not a trainable parameter)
self.register_buffer('pe', pe.unsqueeze(0))
def forward(self, x):
# x shape: [batch_size, seq_len, d_model]
return x + self.pe[:, :x.size(1)]
Integrating Embeddings and Position
The final input to the Transformer encoder or decoder is the element-wise sum of the token embedding and the positional encoding. Dropout is typically applied to this sum to improve regularization.
class TransformerInputLayer(nn.Module):
def __init__(self, v_size, d_model, max_sequence, dropout_rate=0.1):
super().__init__()
self.token_mapping = nn.Embedding(v_size, d_model)
self.pos_mapping = FixedPositionalEncoding(d_model, max_sequence)
self.dropout = nn.Dropout(p=dropout_rate)
self.scaling = math.sqrt(d_model)
def forward(self, indices):
# Map tokens and scale by sqrt of d_model
x = self.token_mapping(indices) * self.scaling
# Inject positional information
x = self.pos_mapping(x)
return self.dropout(x)