Home > Tech > Content

Graph Attention Networks: Adaptive Feature Aggregation via Self-Attention

Tech 2

Graph Attention Networks (GATs) introduce attention mechanisms into graph neural architectures to enable adaptive weighting of neighboring nodes. Proposed by Veličković et al. in 2017, this architecture computes importance coefficients between connected nodes through masked self-attention layers, allowing the model to focus on relevant topology regions without requiring costly matrix operations or prior knowledge of graph structure.

The attention mechanism adapts the concept originally developed for sequence modeling in neural machine translation. Unlike Graph Convolutional Networks that rely on fixed normalization constants based on node degree, GATs dynamically compute attention coefficients by comparing node features across edges. This produces learnable weights that indicate which neighbors contribute most significantly to a node's updated representation.

A single GAT layer computes attention scores using a shared linear transformation followed by an attention mechanism. For nodes i and j connected by an edge, the attention coefficient e_ij represents the importance of node j's features to node i. These coefficients are normalized across all neighbors using softmax, then used to compute a weighted combination of features. Multi-head attention extends this by performing K independent attention computations in parallel and concatenating or averaging their outputs.

Implemetnation using PyTorch Geometric demonstrates this architecture for node classification tasks:

import torch
import torch.nn.functional as F
from torch.nn import Module
from torch_geometric.nn import GATConv
from torch_geometric.datasets import Planetoid

class AttnGraphNet(Module):
    def __init__(self, feat_dim, hidden_dim, class_dim, attn_heads=4):
        super().__init__()
        self.layer1 = GATConv(feat_dim, hidden_dim, heads=attn_heads, dropout=0.6)
        self.layer2 = GATConv(hidden_dim * attn_heads, class_dim, heads=1, 
                              concat=False, dropout=0.6)
        
    def forward(self, node_feats, edge_list):
        h = self.layer1(node_feats, edge_list)
        h = F.elu(h)
        h = F.dropout(h, p=0.6, training=self.training)
        h = self.layer2(h, edge_list)
        return F.log_softmax(h, dim=1)

def optimize(model, data, optimizer):
    model.train()
    optimizer.zero_grad()
    predictions = model(data.x, data.edge_index)
    loss = F.nll_loss(predictions[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()
    return loss.item()

def evaluate(model, data):
    model.eval()
    with torch.no_grad():
        logits = model(data.x, data.edge_index)
        predictions = logits.argmax(dim=1)
        train_acc = (predictions[data.train_mask] == data.y[data.train_mask]).sum().item() / data.train_mask.sum().item()
        val_acc = (predictions[data.val_mask] == data.y[data.val_mask]).sum().item() / data.val_mask.sum().item()
        test_acc = (predictions[data.test_mask] == data.y[data.test_mask]).sum().item() / data.test_mask.sum().item()
    return train_acc, val_acc, test_acc

# Configuration
hidden_units = 16
attention_heads = 8
learning_rate = 0.005
weight_decay = 5e-4
epochs = 300

# Data preparation
dataset = Planetoid(root='/tmp/Cora', name='Cora')
graph = dataset[0]

# Model initialization
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
classifier = AttnGraphNet(dataset.num_features, hidden_units, 
                         dataset.num_classes, attention_heads).to(device)
graph = graph.to(device)
optimizer = torch.optim.Adam(classifier.parameters(), lr=learning_rate, 
                            weight_decay=weight_decay)

# Training procedure
for iteration in range(1, epochs + 1):
    loss = optimize(classifier, graph, optimizer)
    if iteration % 50 == 0:
        train_acc, val_acc, test_acc = evaluate(classifier, graph)
        print(f'Iter: {iteration:03d}, Loss: {loss:.4f}, Train: {train_acc:.4f}, Val: {val_acc:.4f}, Test: {test_acc:.4f}')

This implementation leverages multi-head attention to stabilize the learning process and improve model capacity. The attention mechanism enables the network to assign varying importance to different neighbors, making the architecture particularly effective for graphs with heterogeneous neighbrohood structures or varying node degrees.

Tags: graph-attention-networks

Back to List

Prev: Building an Automated Work Log Generator with Python and Excel

Next: JavaScript Objects Fundamentals and Creation Patterns

Fading Coder

Graph Attention Networks: Adaptive Feature Aggregation via Self-Attention

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor